分页不适用于基本的网络爬虫答案

【问题标题】：Pagination not working on basic webscraper分页不适用于基本的网络爬虫
【发布时间】：2021-08-20 21:49:48
【问题描述】：

import scrapy


class BestBooksSpider(scrapy.Spider):
    name = 'best_books'
    page_num = 2
    allowed_domains = [
        'www.goodreads.com/list/show/1.Best_Books_Ever?page=1']
    start_urls = [
        'https://www.goodreads.com/list/show/1.Best_Books_Ever?page=1']

    def parse(self, response):
        page_num = 2
        for books in response.xpath('//tr'):
            yield {
                'Title': books.css('a.bookTitle span::text').get(),
                'Author': books.css('a.authorName *::text').get(),
                'Rating': books.css('span.minirating::text').get(),
            }

        # this part is not working, won't read past page 1

        next_page = 'https://www.goodreads.com/list/show/1.Best_Books_Ever?page=' + \
            str(BestBooksSpider.page_num)
        if BestBooksSpider.page_num < 3:
            BestBooksSpider.page_num += 1
            yield response.follow(next_page, callback=self.parse)

首页效果很好，但它不会阅读后续页面。我从其他教程中尝试了许多不同的代码变体，但均未成功。我在scrapy中没有收到任何错误代码。 Scrapy 只是表示它已完成。

【问题讨论】：

日志说什么？你的allowed_domains 开始是错误的……

标签： python pagination scrapy

【解决方案1】：

您的allowed_domains 看起来确实可能是分页无法正常工作的原因。
allowed_domains = ['www.goodreads.com/list/show/1.Best_Books_Ever?page=1'] 应该将您的刮板限制在第一页，所以请继续删除此行并尝试您的蜘蛛再次。

【讨论】：