【发布时间】:2020-10-27 07:59:30
【问题描述】:
我尝试构建一个包含 2020 年国际足联球员的数据集。我刚刚开始使用 Python BeatifulSoup 进行网络抓取。 所以我想从这个网站上爬取:https://sofifa.com/?r=200061&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo&showCol%5B%5D=pi 到目前为止,我能够得到我想要的内容。但我有一个问题,网站显示前 60 名玩家,然后有一个“下一步”按钮,我不知道如何激活它以继续在下一页上抓取。 我想获取所有玩家的数据。
这是我目前所拥有的:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# create dataframe to store data
column_names = ["Name", "Age", "Overall Rating", "Potential", "Team", "Contract expiry", "Height", "Weight", "Strong foot", "Value"]
df = pd.DataFrame(columns = column_names)
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://sofifa.com/?r=200054&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Players = pageSoup.find_all("a", {"class": "tooltip"})
Age = pageSoup.find_all("td", {"class": "col col-ae"})
OR = pageSoup.find_all("td", {"class": "col col-oa col-sort"})
PR = pageSoup.find_all("td", {"class": "col col-pt"})
Team = pageSoup.find_all("div", {"class": "bp3-text-overflow-ellipsis"})
contract = pageSoup.find_all("div", {"class": "sub"})
height = pageSoup.find_all("td", {"class": "col col-hi"})
weight = pageSoup.find_all("td", {"class": "col col-wi"})
PF = pageSoup.find_all("td", {"class": "col col-pf"})
Value = pageSoup.find_all("td", {"class": "col col-vl"})
Players_List = []
Age_List = []
OR_List = []
PR_List = []
Team_List = []
contract_List = []
height_List = []
weight_List = []
PF_List = []
Value_List = []
j = 1
for i in range(0,60):
Players_List.append(Players[i].text)
Age_List.append(Age[i].text)
OR_List.append(OR[i].text)
PR_List.append(PR[i].text)
Team_List.append(Team[i+j].text)
contract_List.append(contract[i].text)
height_List.append(height[i].text)
weight_List.append(weight[i].text)
PF_List.append(PF[i].text)
Value_List.append(Value[i].text)
j=j+1
df = pd.DataFrame({"Name":Players_List, "Age": Age_List, "Overall Rating":OR_List, "Potential":PR_List, "Team":Team_List, "Contract expiry":contract_List, "Height":height_List,"Weight":weight_List, "Strong foot":PF_List, "Value":Value_List})
希望有人可以帮助我
【问题讨论】:
-
我建议您选择以下两种方法之一:使用像 Selenium 这样的库,它允许您模拟不同的用户输入。湾。如果您检查向服务器(甚至是 URL)发出的请求,最后有一个称为偏移量的参数。这用于了解要展示的玩家。所以你可以增加它以获得你想要的球员。
-
BeautifulSoup无法做到这一点。它只能抓取页面。它不能点击按钮或其他东西......这是自动化的一部分。为此,您需要使用selenium
标签: python html button web-scraping beautifulsoup