【问题标题】:Python BeautifulSoup unable to join base URL and scraped truncated links,Python BeautifulSoup 无法加入基本 URL 并抓取截断的链接,
【发布时间】:2020-06-06 19:35:46
【问题描述】:

尝试使用来自 FCPython 的请求和 BeautifulSoup 改编代码来抓取玩家数据。我成功地抓取并关注了团队链接,然后为每个玩家链接获取了一个截断的 href,但是当我尝试将站点基本 URL 加入到玩家链接 href 时,我遇到了问题。 我无法弄清楚为什么在加入被截断的播放器链接之前,基本 URL 经常重复多次。

任何帮助或指导将不胜感激。

  • 请参阅下面的代码和输出示例。

    import requests
    from bs4 import BeautifulSoup
    from os.path  import basename
    
    headers = {'User-Agent': 
               'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
    
    page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')
    
    teamLinks = []
    
    links = soup.select("td.hauptlink.no-border-links.show-for-small.show-for-pad a")
    
    for i in range(0,20):
        teamLinks.append(links[i].get("href"))
    
    for i in range(len(teamLinks)):
        teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
    
    playerLinks = []
    
    #Run the scraper through each of our 20 team links
    for i in range(len(teamLinks)):
    
        page = teamLinks[i]
        tree = requests.get(page, headers = headers)
        soup = BeautifulSoup(tree.content, 'html.parser')
    
        links = soup.select("span.show-for-small a")
    
        for j in range(len(links)):
            playerLinks.append(links[j].get("href"))
    
        for j in range(len(playerLinks)):
            playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]
    
        playerLinks = list(set(playerLinks))
    
    print(playerLinks)
    

    示例输出:-

    ['https://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/ryan-fraser/profil/spieler/146795',
    

【问题讨论】:

    标签: python web-scraping beautifulsoup python-requests


    【解决方案1】:

    在您的代码中:

    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]
    

    您一次又一次地将"https://www.transfermarkt.co.uk" 附加到列表中找到的字符串。删除此循环并在此处仅附加一次基本 URL:

    playerLinks.append("https://www.transfermarkt.co.uk" + links[j].get("href"))
    

    最终代码:

    import requests
    from bs4 import BeautifulSoup
    from os.path  import basename
    
    headers = {'User-Agent':
               'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
    
    page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')
    
    teamLinks = []
    
    links = soup.select("td.hauptlink.no-border-links.show-for-small.show-for-pad a")
    
    for i in range(0,20):
        teamLinks.append(links[i].get("href"))
    
    for i in range(len(teamLinks)):
        teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
    
    playerLinks = []
    
    #Run the scraper through each of our 20 team links
    for i in range(len(teamLinks)):
    
        page = teamLinks[i]
        tree = requests.get(page, headers = headers)
        soup = BeautifulSoup(tree.content, 'html.parser')
    
        links = soup.select("span.show-for-small a")
    
        for j in range(len(links)):
            playerLinks.append("https://www.transfermarkt.co.uk" + links[j].get("href"))
    
        playerLinks = list(set(playerLinks))
    
    print(playerLinks)
    

    打印:

    ['https://www.transfermarkt.co.uk/joshua-king/profil/spieler/91059', 'https://www.transfermarkt.co.uk/michael-verrips/profil/spieler/288259', 'https://www.transfermarkt.co.uk/teemu-pukki/profil/spieler/46972', 'https://www.transfermarkt.co.uk/sander-berge/profil/spieler/333014', 'https://www.transfermarkt.co.uk/dwight-mcneil/profil/spieler/584769', 'https://www.transfermarkt.co.uk/sam-byram/profil/spieler/236953', 'https://www.transfermarkt.co.uk/carlos-sanchez/profil/spieler/51226',
    
    ...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-07-29
      • 2015-07-25
      • 1970-01-01
      • 2015-03-08
      • 1970-01-01
      • 1970-01-01
      • 2014-06-01
      • 2021-01-10
      相关资源
      最近更新 更多