Python 美丽的汤每页显示相同的结果答案

【问题标题】：Python beautiful soup showing same result for each pagePython 美丽的汤每页显示相同的结果
【发布时间】：2015-12-29 05:34:19
【问题描述】：

我正在尝试使用 python 3.5 和漂亮的汤从蒸汽的每一页上刮掉标题，以便免费播放搜索结果。但是返回的结果只是第一页上的标题，而不是后续页面上的标题：

import requests 
from bs4 import BeautifulSoup

titles_list=[]

for i in range(3):      # Number of pages plus one 
    print(i)
    url = 'http://store.steampowered.com/genre/Free%20to%20Play/?tab=MostPlayed#p' + str(i)
    print(url)

    soup = BeautifulSoup(requests.get(url).content)

    titles=soup.find_all("div",{"class":"tab_item_name"})

    for item in titles:
         try:
              name=item.text
         except:
              name='sdfsd'   

         print(name)
         titles_list.append(name)

控制台结果（我知道 0 和 1 相同，但 i=2 应该显示不同的游戏集）：

0
http://store.steampowered.com/genre/Free%20to%20Play/?tab=MostPlayed#p0
Dota 2
Team Fortress 2
Warframe
Clicker Heroes
Unturned
Path of Exile
War Thunder
SMITE
Trove
AdVenture Capitalist
1
http://store.steampowered.com/genre/Free%20to%20Play/?tab=MostPlayed#p1
Dota 2
Team Fortress 2
Warframe
Clicker Heroes
Unturned
Path of Exile
War Thunder
SMITE
Trove
AdVenture Capitalist
2
http://store.steampowered.com/genre/Free%20to%20Play/?tab=MostPlayed#p2
Dota 2
Team Fortress 2
Warframe
Clicker Heroes
Unturned
Path of Exile
War Thunder
SMITE
Trove
AdVenture Capitalist

有人知道这里发生了什么吗？

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

您可以使用获取数据的底层 GET 请求，这对于解析来说更可靠，因为您无法获取 javascript 使用实际页面的 url 呈现的详细信息。

使用 firebug，我发现底层 GET 请求是（第 2 页）：

http://store.steampowered.com/search/tabpaginated/render/?query=&start=10&count=10&genre=37&tab=MostPlayed&cc=IN&l=english

然后，使用以下脚本从所有 32 页中获取所有标题。

import requests
import json
from bs4 import BeautifulSoup
import re

for i in range(0, 32):
    start_count = i * 10;
    jsonResponse = requests.get("http://store.steampowered.com/search/tabpaginated/render/"
                                "?query=&start="+str(start_count)+"&count=10&genre=37&tab=MostPlayed&cc=IN&l=english")
    data = json.loads(jsonResponse.text)
    soup = BeautifulSoup(data["results_html"], "html.parser")
    alltitles = soup.find_all(attrs={'class': re.compile('tab_item_name')})
    for title in alltitles:
        print(title.text)

【讨论】：

【解决方案2】：

我认为这里发生的情况是以“Dota 2”开头的列表总是在原始请求时加载到http://store.steampowered.com/genre/Free%20to%20Play/?tab=MostPlayed。

您可以在浏览器的开发人员工具中检查它始终首先加载，无论#p 后面的数字如何。这也是你的requests.get(url).content 得到的。

如果您在浏览器中以 #p2 结尾刷新 url，您有时会在显示另一个列表之前看到第一个列表。

在快速查看后我不确定加载另一个列表的内容，但它一定是在发出请求后发生的事情。

【讨论】：

【解决方案3】：

传统上，# 片段是页面内部的客户端锚点，实际上并不传递给服务器。客户端 JavaScript 经常将其用于各种目的，所以这可能就是这里发生的事情。您需要在页面客户端运行或模拟 JavaScript 以获得后续结果。

【讨论】：