【发布时间】:2020-06-06 19:35:46
【问题描述】:
尝试使用来自 FCPython 的请求和 BeautifulSoup 改编代码来抓取玩家数据。我成功地抓取并关注了团队链接,然后为每个玩家链接获取了一个截断的 href,但是当我尝试将站点基本 URL 加入到玩家链接 href 时,我遇到了问题。 我无法弄清楚为什么在加入被截断的播放器链接之前,基本 URL 经常重复多次。
任何帮助或指导将不胜感激。
-
请参阅下面的代码和输出示例。
import requests from bs4 import BeautifulSoup from os.path import basename headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'} page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1' tree = requests.get(page, headers = headers) soup = BeautifulSoup(tree.content, 'html.parser') teamLinks = [] links = soup.select("td.hauptlink.no-border-links.show-for-small.show-for-pad a") for i in range(0,20): teamLinks.append(links[i].get("href")) for i in range(len(teamLinks)): teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i] playerLinks = [] #Run the scraper through each of our 20 team links for i in range(len(teamLinks)): page = teamLinks[i] tree = requests.get(page, headers = headers) soup = BeautifulSoup(tree.content, 'html.parser') links = soup.select("span.show-for-small a") for j in range(len(links)): playerLinks.append(links[j].get("href")) for j in range(len(playerLinks)): playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j] playerLinks = list(set(playerLinks)) print(playerLinks)示例输出:-
['https://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/ryan-fraser/profil/spieler/146795',
【问题讨论】:
标签: python web-scraping beautifulsoup python-requests