抓取时激活按钮以进入下一页（Python，BeautifulSoup）答案

【问题标题】：Activate button to get to next page while scraping (Python, BeautifulSoup)抓取时激活按钮以进入下一页（Python，BeautifulSoup）
【发布时间】：2020-10-27 07:59:30
【问题描述】：

我尝试构建一个包含 2020 年国际足联球员的数据集。我刚刚开始使用 Python BeatifulSoup 进行网络抓取。所以我想从这个网站上爬取：https://sofifa.com/?r=200061&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo&showCol%5B%5D=pi 到目前为止，我能够得到我想要的内容。但我有一个问题，网站显示前 60 名玩家，然后有一个“下一步”按钮，我不知道如何激活它以继续在下一页上抓取。我想获取所有玩家的数据。

这是我目前所拥有的：

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# create dataframe to store data
column_names = ["Name", "Age", "Overall Rating", "Potential", "Team", "Contract expiry", "Height", "Weight", "Strong foot", "Value"] 
df = pd.DataFrame(columns = column_names)


headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://sofifa.com/?r=200054&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("a", {"class": "tooltip"})
Age = pageSoup.find_all("td", {"class": "col col-ae"})
OR = pageSoup.find_all("td", {"class": "col col-oa col-sort"})
PR = pageSoup.find_all("td", {"class": "col col-pt"})
Team = pageSoup.find_all("div", {"class": "bp3-text-overflow-ellipsis"})
contract = pageSoup.find_all("div", {"class": "sub"})
height = pageSoup.find_all("td", {"class": "col col-hi"})
weight = pageSoup.find_all("td", {"class": "col col-wi"})
PF = pageSoup.find_all("td", {"class": "col col-pf"})
Value = pageSoup.find_all("td", {"class": "col col-vl"})


Players_List = []
Age_List = []
OR_List = []
PR_List = []
Team_List = []
contract_List = []
height_List = []
weight_List = []
PF_List = []
Value_List = []

j = 1

for i in range(0,60):
    Players_List.append(Players[i].text)
    Age_List.append(Age[i].text)
    OR_List.append(OR[i].text)
    PR_List.append(PR[i].text)
    Team_List.append(Team[i+j].text)
    contract_List.append(contract[i].text)
    height_List.append(height[i].text)
    weight_List.append(weight[i].text)
    PF_List.append(PF[i].text)
    Value_List.append(Value[i].text)
    j=j+1
df = pd.DataFrame({"Name":Players_List, "Age": Age_List, "Overall Rating":OR_List, "Potential":PR_List, "Team":Team_List, "Contract expiry":contract_List, "Height":height_List,"Weight":weight_List, "Strong foot":PF_List, "Value":Value_List})

希望有人可以帮助我

【问题讨论】：

我建议您选择以下两种方法之一：使用像 Selenium 这样的库，它允许您模拟不同的用户输入。湾。如果您检查向服务器（甚至是 URL）发出的请求，最后有一个称为偏移量的参数。这用于了解要展示的玩家。所以你可以增加它以获得你想要的球员。
BeautifulSoup 无法做到这一点。它只能抓取页面。它不能点击按钮或其他东西......这是自动化的一部分。为此，您需要使用selenium

标签： python html button web-scraping beautifulsoup

【解决方案1】：

我注意到链接末尾有一个offset，因此您可以像这样编辑您的代码而无需使用selenium：

number_of_pages = 10
page = "https://sofifa.com/?r=200061&set=true&showCol[]=ae&showCol[]=oa&showCol[]=pt&showCol[]=vl&showCol[]=hi&showCol[]=wi&showCol[]=pf&showCol[]=bo&showCol[]=pi&offset="
for num_page in range(0, 10):
    pageTree = requests.get(page+str(num_page*60), headers=headers)
    """
        Rest of the code
    """

【讨论】：

我明白了。根据您提供的信息，我能够使其工作。非常感谢。
我只是想知道你是否知道如何收集球员的全名，现在我得到的是“L. Messi”而不是“Lionel Messi”。这是html结构：
::之前
cdn.sofifa.com/flags/ar.png" data-src="cdn.sofifa.com/flags/ar.png" data-srcset= "cdn.sofifa.com/flags/ar@2x.png 2x, cdn.sofifa.com/flags/ar@3x.png 3x" class="标志加载" srcset="cdn.sofifa.com/flags/ar@2x.png 2x, cdn.sofifa.com/flags/ar@3x.png 3x" data-was-processed="true"> "L.Messi"
当你得到元素“a”时，从中提取“data-tooltip”中的信息。