【问题标题】:Getting all instances of a 404 error with scrapy使用 scrapy 获取 404 错误的所有实例
【发布时间】:2018-03-22 10:01:35
【问题描述】:

我让 Scrapy 抓取我的网站,找到带有 404 响应的链接并将这些链接返回到 JSON 文件。这真的很好用。

但是,我无法弄清楚如何获取该坏链接的所有实例,因为重复过滤器正在捕获这些链接而不是重试它们。

由于我们的网站有数千个页面,这些部分由多个团队管理,我需要能够为每个部分创建一个坏链接报告,而不是找到一个并在整个网站上进行搜索替换。

任何帮助将不胜感激。

我目前的爬虫:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

# Add Items for exporting to JSON
class DevelopersLinkItem(Item):
    url = Field()
    referer = Field()
    link_text = Field()
    status = Field()
    time = Field()

class DevelopersSpider(CrawlSpider):
    """Subclasses Crawlspider to crawl the given site and parses each link to JSON"""

    # Spider name to be used when calling from the terminal
    name = "developers_prod"

    # Allow only the given host name(s)
    allowed_domains = ["example.com"]

    # Start crawling from this URL
    start_urls = ["https://example.com"]

    # Which status should be reported
    handle_httpstatus_list = [404]

    # Rules on how to extract links from the DOM, which URLS to deny, and gives a callback if needed
    rules = (Rule(LxmlLinkExtractor(deny=([
        '/android/'])), callback='parse_item', follow=True),)

    # Called back to for each requested page and used for parsing the response
    def parse_item(self, response):
        if response.status == 404:
            item = DevelopersLinkItem()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['link_text'] = response.meta.get('link_text')
            item['status'] = response.status
            item['time'] = self.now.strftime("%Y-%m-%d %H:%M")

            return item

我尝试了一些自定义的欺骗过滤器,但最终都没有奏效。

【问题讨论】:

    标签: python hyperlink scrapy duplicates http-status-code-404


    【解决方案1】:

    如果我正确理解了您的问题,则默认情况下,crawlspider 会过滤您的请求。您可以使用 Rule 类的 process_request 参数为每​​个请求设置 dont_filter=True (https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Rule)

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-06-18
      • 2016-08-07
      • 2015-05-29
      • 1970-01-01
      • 1970-01-01
      • 2011-10-24
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多