Scrapy 分页在第 2 页后失败答案

【问题标题】：Scrapy pagination fails after page 2Scrapy 分页在第 2 页后失败
【发布时间】：2019-09-05 15:28:34
【问题描述】：

我正在创建一个蜘蛛，它会抓取这里的每一页：http://web.archive.org/web/20141217173753/http://www.docstoc.com/documents/legal/ 并只返回卡片名称。正如我所料，它应该从起始页收集所有项目，然后按照“Next”分页链接（“BookEnd”类）并重复直到没有这样的链接。

我需要更改哪些内容才能使分页正常工作？

我是网络抓取的新手。我已经通过手动将每个页面输入到start_urls 来使这个蜘蛛工作，但我想让它更自动化。

#!/usr/bin/env python3

import scrapy
from scrapy.http import Request

class TypeSpider(scrapy.Spider):
    name = "types"
    start_urls = ["https://web.archive.org/web/20141217173745/http://www.docstoc.com/documents/legal"]

    def parse(self, response):
        for card1 in response.xpath("//*[@class='doc-title']"):
            text = card1.xpath(".//a/text()").extract_first()
            yield{"Title": text}
        for card2 in response.xpath("//*[@class='col-sm-10']"):
            text = card2.xpath(".//h3/text()").extract_first()
            yield{"Title": text}
        next_page = response.css("li.BookEnd > a::attr(href)").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url=next_page, callback=self.parse)

我希望蜘蛛爬取所有 34 页，但它在第 2 页后退出：

DEBUG: Filtered duplicate request: <GET https://web.archive.org/web/20141217173750/http://www.docstoc.com/documents/legal/2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

dont_filter 对我不起作用。

附：我在这里同时使用 xPath 和 CSS 只是因为我无法使用 xPath 提取分页链接——不知道为什么。

【问题讨论】：

标签： python python-3.x callback scrapy

【解决方案1】：

您的 CSS 选择器要转到下一页，实际上只要您不在第一页上，就会转到上一页。解决此问题的方法如下：

next_page = response.css("li.BookEnd > a::attr(href)").extract()[-1]

【讨论】：