【问题标题】:How to get text from next pages using Beautifulsoup in python 3?如何在 python 3 中使用 Beautifulsoup 从下一页获取文本?
【发布时间】:2016-10-14 02:50:46
【问题描述】:

我正在尝试获取团队每个页面的所有游戏结果。到目前为止,我能够获得所有对手 1 对对手 2 的结果并得分。但我不知道如何获取下一页以获取其余数据。我会找到下一页并将其放入while循环吗?这是我想要的团队的链接

http://www.gosugamers.net/counterstrike/teams/7397-natus-vincere/matches

这是我目前所拥有的,它只记录了所有球队的比赛和得分。

def all_match_outcomes():

    for match_outcomes in match_history_url():
        rest_server(True)
        page = requests.get(match_outcomes).content
        soup = BeautifulSoup(page, 'html.parser')

        team_name_element = soup.select_one('div.teamNameHolder')
        team_name = team_name_element.find('h1').text.replace('- Team Overview', '')

        for match_outcome in soup.select('table.simple.gamelist.profilelist tr'):
            opp1 = match_outcome.find('span', {'class': 'opp1'}).text
            opp2 = match_outcome.find('span', {'class': 'opp2'}).text

            opp1_score = match_outcome.find('span', {'class': 'hscore'}).text
            opp2_score = match_outcome.find('span', {'class': 'ascore'}).text

            if match_outcome(True):  # If teams have past matches
                print(team_name, '%s %s:%s %s' % (opp1, opp1_score, opp2_score, opp2))

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup html-parsing


    【解决方案1】:

    获取最后页码并逐页迭代,直到到达最后一页。

    完整的工作代码:

    import re
    
    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.gosugamers.net/counterstrike/teams/7397-natus-vincere/matches"
    
    with requests.Session() as session:
        response = session.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
    
        # locate the last page link
        last_page_link = soup.find("span", text="Last").parent["href"]
        # extract the last page number
        last_page_number = int(re.search(r"page=(\d+)$", last_page_link).group(1))
    
        print("Processing page number 1")
        # TODO: extract data
    
        # iterate over all pages starting from page 2 (since we are already on the page 1)
        for page_number in range(2, last_page_number+1):
            print("Processing page number %d" % page_number)
    
            link = "http://www.gosugamers.net/counterstrike/teams/7397-natus-vincere/matches?page=%d" % page_number
            response = session.get(link)
    
            soup = BeautifulSoup(response.content, "html.parser")
    
            # TODO: extract data
    

    【讨论】:

    • 那么当没有更多页面可以通过时会发生什么情况,它会崩溃吗?
    • @DJRodrigue 不,我们通过for page_number in range(2, last_page_number+1) 循环将其从最小页面限制到最大页面。
    • 它似乎给了我和错误:last_page_link = soup.find("span", text="Last").parent['href'] AttributeError: 'NoneType' object has no attribute '父母'
    • @DJRodrigue 好的,更新了完整的工作代码。
    • 我认为为什么我收到错误是因为我解析的不仅仅是那个团队链接,而且有些团队没有超过一页,所以没有 .parent['href']。这就是它可能给我一个错误的原因吗?
    猜你喜欢
    • 2016-07-01
    • 1970-01-01
    • 1970-01-01
    • 2016-03-24
    • 1970-01-01
    • 2020-05-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多