【问题标题】:Activate button to get to next page while scraping (Python, BeautifulSoup)抓取时激活按钮以进入下一页(Python,BeautifulSoup)
【发布时间】:2020-10-27 07:59:30
【问题描述】:

我尝试构建一个包含 2020 年国际足联球员的数据集。我刚刚开始使用 Python BeatifulSoup 进行网络抓取。 所以我想从这个网站上爬取:https://sofifa.com/?r=200061&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo&showCol%5B%5D=pi 到目前为止,我能够得到我想要的内容。但我有一个问题,网站显示前 60 名玩家,然后有一个“下一步”按钮,我不知道如何激活它以继续在下一页上抓取。 我想获取所有玩家的数据。

这是我目前所拥有的:

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# create dataframe to store data
column_names = ["Name", "Age", "Overall Rating", "Potential", "Team", "Contract expiry", "Height", "Weight", "Strong foot", "Value"] 
df = pd.DataFrame(columns = column_names)


headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://sofifa.com/?r=200054&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("a", {"class": "tooltip"})
Age = pageSoup.find_all("td", {"class": "col col-ae"})
OR = pageSoup.find_all("td", {"class": "col col-oa col-sort"})
PR = pageSoup.find_all("td", {"class": "col col-pt"})
Team = pageSoup.find_all("div", {"class": "bp3-text-overflow-ellipsis"})
contract = pageSoup.find_all("div", {"class": "sub"})
height = pageSoup.find_all("td", {"class": "col col-hi"})
weight = pageSoup.find_all("td", {"class": "col col-wi"})
PF = pageSoup.find_all("td", {"class": "col col-pf"})
Value = pageSoup.find_all("td", {"class": "col col-vl"})


Players_List = []
Age_List = []
OR_List = []
PR_List = []
Team_List = []
contract_List = []
height_List = []
weight_List = []
PF_List = []
Value_List = []

j = 1

for i in range(0,60):
    Players_List.append(Players[i].text)
    Age_List.append(Age[i].text)
    OR_List.append(OR[i].text)
    PR_List.append(PR[i].text)
    Team_List.append(Team[i+j].text)
    contract_List.append(contract[i].text)
    height_List.append(height[i].text)
    weight_List.append(weight[i].text)
    PF_List.append(PF[i].text)
    Value_List.append(Value[i].text)
    j=j+1
df = pd.DataFrame({"Name":Players_List, "Age": Age_List, "Overall Rating":OR_List, "Potential":PR_List, "Team":Team_List, "Contract expiry":contract_List, "Height":height_List,"Weight":weight_List, "Strong foot":PF_List, "Value":Value_List})

希望有人可以帮助我

【问题讨论】:

  • 我建议您选择以下两种方法之一:使用像 Selenium 这样的库,它允许您模拟不同的用户输入。湾。如果您检查向服务器(甚至是 URL)发出的请求,最后有一个称为偏移量的参数。这用于了解要展示的玩家。所以你可以增加它以获得你想要的球员。
  • BeautifulSoup 无法做到这一点。它只能抓取页面。它不能点击按钮或其他东西......这是自动化的一部分。为此,您需要使用selenium

标签: python html button web-scraping beautifulsoup


【解决方案1】:

我注意到链接末尾有一个offset,因此您可以像这样编辑您的代码而无需使用selenium

number_of_pages = 10
page = "https://sofifa.com/?r=200061&set=true&showCol[]=ae&showCol[]=oa&showCol[]=pt&showCol[]=vl&showCol[]=hi&showCol[]=wi&showCol[]=pf&showCol[]=bo&showCol[]=pi&offset="
for num_page in range(0, 10):
    pageTree = requests.get(page+str(num_page*60), headers=headers)
    """
        Rest of the code
    """

【讨论】:

猜你喜欢
相关资源
最近更新 更多
热门标签