Python - 在单个 for 循环中读取文本行、浏览 html 并写入文档 [关闭]答案

【问题标题】：Python - reading lines of text, exploring html, and writing to a document within a single for loop [closed]Python - 在单个 for 循环中读取文本行、浏览 html 并写入文档 [关闭]
【发布时间】：2014-04-09 16:11:21
【问题描述】：

我打算以一般的方式提出这个问题，但我意识到这对我来说太复杂了，无法尝试一般地描述它。所以这里是具体的：

我不是程序员。我是实验心理学硕士的候选人，作为一个统计课的辅助项目，我创建了一个模型来预测 Steam 上的游戏购买情况。我开始学习如何编程，以便为这个项目收集数据。

到目前为止我的程序如下：

#The first line opens up a list of random Steam IDs i already created, 
#the second assigns them to a variable
list = open('d:\python\SteamUserIDs.txt').read().splitlines()
SteamID = str(list)

#For the purposes of figuring things out, I'm using just the first 10 entries in my list
#The next four lines are the URL requests and assigning the output to a variable "response"
for SteamID in list[0:10:1]:
    request = urllib2.Request('http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=MYSTEAMAPIKEY&steamid=%s' %SteamID, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36"})
    response = urllib2.urlopen(request)
    request2 = urllib2.Request('http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0001/?key=MYSTEAMAPIKEY&steamids=%s' %SteamID, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36"})
    response2 = urllib2.urlopen(request2)

    f1 = open('D:/Python/Steam/UserData/%s PlayerInfo.txt'% (SteamID, ), 'w')
    for lines in response.readlines():
        f1.write(lines)
    for lines in response2.readlines():
        f1.write(lines)
    f1.close()

到目前为止，该程序运行良好。它做了它需要做的事情，但我需要更多信息。不幸的是，我还没有找到任何方法来通过 Steam API 访问我感兴趣的其他变量。但是，我感兴趣的其余信息可在 Steam 上用户个人资料的 html 源中找到。这就是我遇到问题的地方。我可以从第二个请求中的一行获取配置文件 URL。

在第二个 url 请求中，有一行内容如下：

"profileurl": "http://steamcommunity.com/id/PLAYERID/"

其中“PLAYERID”是用户自己创建的字符串

或

"profileurl": "http://steamcommunity.com/profiles/STEAMNUMBER/"

其中“STEAMNUMBER”是由 Steam 生成的数字（这与我的 SteamID 变量中使用的数字相同）。我认为这在用户尚未为其个人资料创建自定义名称时使用。

问题 1：我在打印上面的播放器 URL 时遇到了困难。我一直在尝试使用 "profileurl": 作为目标，然后使用 line.split() 来捕获 URL，但我总是以时髦的字符结束，指示标签和返回，我不知道如何摆脱引号。

问题2：进入html页面时，我可以手动找到数据，但我不知道如何告诉python去寻找它。我感兴趣的信息之一是一个人所做的评论数量。您可以在 html 的这一部分找到这些信息：

<div class="profile_count_link">
                <a     href="http://steamcommunity.com/id/STEAMUSER/recommended/">
                    <span     class="count_link_label">Reviews</span>&nbsp;
                        <span     class="profile_count_link_total">

                                                              3

对于这些部分，我感兴趣的只是数字，但如果它与我用作参考的文本位于不同的行上，我真的不知道如何捕获它。

问题 3：是否可以将此代码保留在我当前的程序和 for 循环中，以便数字显示在同一个文档中？我曾尝试附加一段代码来查找个人资料 URL，但在尝试之后，我开始丢失部分之前的回复。

抱歉这篇冗长的帖子。

【问题讨论】：

我不明白。你从服务器收到 html 或 JSON 吗？
我被告知它是默认的 JSON，但我不知道 JSON 是什么。

标签： python html for-loop steam

【解决方案1】：

当您调用 Steam API 时，请将&format=json 附加到您的网址。即，在下面的网址中：

http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=MYSTEAMAPIKEY&steamid=%s
http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0001/?key=MYSTEAMAPIKEY&steamids=%s

我认为它返回的默认格式是json，但只是使其明确。

得到结果后，使用 python 的 json 模块并将数据加载为 json 对象

data = json.load(response)

问题 1：
然后，您可以使用data["profileurl"] 访问配置文件 URL。您不需要任何字符串拆分函数来实现这一点。

注意：您需要根据 Steam API 返回的 json 响应的结构更改访问 profileurl 的方式。了解如何格式化 json 数据，您将了解如何访问其中的数据。

问题 2：
要从特定 HTML 中获取内容，您可以使用 BeautifulSoup 库。使用上述 HTML 来使用 BeautifulSoup 获取评论计数，您可以：

from bs4 import BeautifulSoup
html = '''
<div class="profile_count_link">
<a     href="http://steamcommunity.com/id/STEAMUSER/recommended/">
<span     class="count_link_label">Reviews</span>&nbsp;
<span     class="profile_count_link_total">
3
</span>
</div>
'''
soup = BeautifulSoup(html)
review_count = soup.find('span', attrs={'class':'profile_count_link_total'}
print review_count.text # prints 3

问题 3：
不完全确定你在这里问什么。但是从上面建议的事情开始，您将对解决问题有一个更清晰的认识。

【讨论】：

感谢您的指导。这是一个非常基本的问题，但我之前被推荐了beautifulsoup，我下载了它，但我不确定如何“安装”它......如何让它可用？