尝试获取下一个 url 时 Bs4 失败答案

【问题标题】：Bs4 fail when try to get next url尝试获取下一个 url 时 Bs4 失败
【发布时间】：2023-01-04 06:50:25
【问题描述】：

有我的代码

def parser():
    flag = True
    url = 'https://quotes.toscrape.com'
    while flag:
        responce = requests.get(url)
        soup = BeautifulSoup(responce.text, 'html.parser')
        quote_l = soup.find_all('span', {'class': 'text'})
        q_count = 0
        for i in range(len(quote_l)):
            if q_count >= 5:
                flag = False
                break
            quote = soup.find_all('span', {'class': 'text'})[i]
            if not Quote.objects.filter(quote=quote.string).exists():
                author = soup.find_all('small', {'class': 'author'})[i]
                if not Author.objects.filter(name=author.string).exists():
                    a = Author.objects.create(name=author.string)
                    Quote.objects.create(quote=quote.string, author_id=a.id)
                    q_count += 1
                else:
                    a = Author.objects.get(name=author.string)
                    Quote.objects.create(quote=quote.string, author_id=a.id)
                    q_count += 1


        url += soup.find('li', {'class': 'next'}).a['href']

我需要获取下一页，但我有这个 Exc。 “NoneType”对象没有属性“a”

如何解决这个问题，也许我可以如何优化我的 Code.Thx

【问题讨论】：

标签： python parsing beautifulsoup html-parsing

【解决方案1】：

到达最后一页后，将没有 Next 按钮，因此您需要在尝试访问下一页的 href 之前检查退出条件。一种可能是在当前最后一行之前添加以下行：

next_page = soup.find('li', {'class': 'next'})
if not next_page: flag = False  # or return

或者只是 return 在这一点上。

当然，您还需要更新最后一行以使用该变量，并确保您不会连续扩展带有下一页后缀的 url。例如，可以在请求调用期间添加后缀：

def parser():
    flag = True
    url = 'https://quotes.toscrape.com'
    suffix = ''

    while flag:
        responce = requests.get(url + suffix)
        soup = BeautifulSoup(responce.text, 'html.parser')
        # other code
        
        
        next_page = soup.find('li', {'class': 'next'})

        if not next_page: 
            return
        suffix = next_page.a['href']

【讨论】：

已更改。具有相同的 Exc。'NoneType' 对象没有属性 'a'
您正在使用 += 创建错误的下一页 url，因为 URL 在循环之前但在循环内发生了更改。
它的作品。谢谢。当我的脚本运行时，我只需要 5 个引号，但我有 10 个。你能帮我吗
我不清楚您是否需要访问多个页面并从每个页面获取 5 个报价？或一页并获得 5 个引号。我也不知道Quote.objects 是什么或作者一号。这些是自定义类吗？这有助于确保您的代码如图所示可运行并提供预期的错误消息。如果有不需要的代码行，则将其删除，同时保留整体逻辑。无论您做出什么决定，该错误都应该是可重现的。
我建议打开一个新问题，包括重现当前问题的最少代码，解释应该发生什么、正在发生什么以及您尝试解决什么问题。如有必要，请将导入语句和任何其他类引用包括到 minimal reproducible example。