Python BeautifulSoup 无法加入基本 URL 并抓取截断的链接，答案

【问题标题】：Python BeautifulSoup unable to join base URL and scraped truncated links,Python BeautifulSoup 无法加入基本 URL 并抓取截断的链接，
【发布时间】：2020-06-06 19:35:46
【问题描述】：

尝试使用来自 FCPython 的请求和 BeautifulSoup 改编代码来抓取玩家数据。我成功地抓取并关注了团队链接，然后为每个玩家链接获取了一个截断的 href，但是当我尝试将站点基本 URL 加入到玩家链接 href 时，我遇到了问题。我无法弄清楚为什么在加入被截断的播放器链接之前，基本 URL 经常重复多次。

任何帮助或指导将不胜感激。

请参阅下面的代码和输出示例。

import requests
from bs4 import BeautifulSoup
from os.path  import basename

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

teamLinks = []

links = soup.select("td.hauptlink.no-border-links.show-for-small.show-for-pad a")

for i in range(0,20):
    teamLinks.append(links[i].get("href"))

for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]

playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    links = soup.select("span.show-for-small a")

    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    playerLinks = list(set(playerLinks))

print(playerLinks)

示例输出：-

['https://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/ryan-fraser/profil/spieler/146795',

【问题讨论】：

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

在您的代码中：

for j in range(len(playerLinks)):
    playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

您一次又一次地将"https://www.transfermarkt.co.uk" 附加到列表中找到的字符串。删除此循环并在此处仅附加一次基本 URL：

playerLinks.append("https://www.transfermarkt.co.uk" + links[j].get("href"))

最终代码：

import requests
from bs4 import BeautifulSoup
from os.path  import basename

headers = {'User-Agent':
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

teamLinks = []

links = soup.select("td.hauptlink.no-border-links.show-for-small.show-for-pad a")

for i in range(0,20):
    teamLinks.append(links[i].get("href"))

for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]

playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    links = soup.select("span.show-for-small a")

    for j in range(len(links)):
        playerLinks.append("https://www.transfermarkt.co.uk" + links[j].get("href"))

    playerLinks = list(set(playerLinks))

print(playerLinks)

打印：

['https://www.transfermarkt.co.uk/joshua-king/profil/spieler/91059', 'https://www.transfermarkt.co.uk/michael-verrips/profil/spieler/288259', 'https://www.transfermarkt.co.uk/teemu-pukki/profil/spieler/46972', 'https://www.transfermarkt.co.uk/sander-berge/profil/spieler/333014', 'https://www.transfermarkt.co.uk/dwight-mcneil/profil/spieler/584769', 'https://www.transfermarkt.co.uk/sam-byram/profil/spieler/236953', 'https://www.transfermarkt.co.uk/carlos-sanchez/profil/spieler/51226',

...

【讨论】：