【问题标题】:Why does running multiple scrapy spiders through CrawlerProcess cause the spider_idle signal to fail?为什么通过CrawlerProcess运行多个scrapy蜘蛛会导致spider_idle信号失败?
【发布时间】:2019-06-17 22:43:19
【问题描述】:

我需要发出数千个需要会话令牌进行授权的请求。

一次将所有请求排队会导致数千个请求失败,因为会话令牌在发出后续请求之前就过期了。

因此,我发出了合理数量的请求,这些请求将在会话令牌过期之前可靠地完成。

当一批请求完成时,触发spider_idle信号。

如果需要进一步的请求,信号处理程序会请求一个新的会话令牌用于下一批请求。

这在正常运行一只蜘蛛或通过 CrawlerProcess 运行一只蜘蛛时有效。

但是,spider_idle 信号失败,多个蜘蛛通过 CrawlerProcess 运行。

一个蜘蛛会按预期执行 spider_idle 信号,但其他蜘蛛会因以下异常而失败:

2019-06-14 10:41:22 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.spider_idle of <SpideIdleTest None at 0x7f514b33c550>>
Traceback (most recent call last):
  File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "fails_with_multiple_spiders.py", line 25, in spider_idle
    spider)
  File "/home/loren/.virtualenv/spider_idle_test/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 209, in crawl
    "Spider %r not opened when crawling: %s" % (spider.name, request)

我创建了一个 repo,它显示了 spider_idle 在单个蜘蛛上的行为与预期一样,而在多个蜘蛛上失败。

https://github.com/loren-magnuson/scrapy_spider_idle_test

这是显示失败的版本:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request, signals
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher


class SpiderIdleTest(scrapy.Spider):
    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        'DOWNLOAD_DELAY': 2,
    }

    def __init__(self):
        dispatcher.connect(self.spider_idle, signals.spider_idle)
        self.idle_retries = 0

    def spider_idle(self, spider):
        self.idle_retries += 1
        if self.idle_retries < 3:
            self.crawler.engine.crawl(
                Request('https://www.google.com',
                        self.parse,
                        dont_filter=True),
                spider)
            raise DontCloseSpider("Stayin' alive")

    def start_requests(self):
        yield Request('https://www.google.com', self.parse)

    def parse(self, response):
        print(response.css('title::text').extract_first())


process = CrawlerProcess()
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.crawl(SpiderIdleTest)
process.start()

【问题讨论】:

    标签: scrapy


    【解决方案1】:

    我尝试使用台球作为替代方法同时运行多个蜘蛛。

    使用 billiard 的 Process 让蜘蛛并发运行后,spider_idle 信号仍然失败,但有不同的异常。

    Traceback (most recent call last):
      File "/home/louis_powersports/.virtualenv/spider_idle_test/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
        *arguments, **named)
      File "/home/louis_powersports/.virtualenv/spider_idle_test/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
        return receiver(*arguments, **named)
      File "test_with_billiard_process.py", line 25, in spider_idle
        self.crawler.engine.crawl(
    AttributeError: 'SpiderIdleTest' object has no attribute 'crawler'
    

    这导致我尝试改变:

    self.crawler.engine.crawl(
    Request('https://www.google.com',
            self.parse,
            dont_filter=True),
    spider)
    

    spider.crawler.engine.crawl(
    Request('https://www.google.com',
            self.parse,
            dont_filter=True),
    spider)
    

    哪个有效。

    台球不是必需的。在进行上述更改后,基于 Scrapy 文档的原始尝试将起作用。

    原件的工作版本:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    from scrapy import Request, signals
    from scrapy.exceptions import DontCloseSpider
    from scrapy.xlib.pydispatch import dispatcher
    
    
    class SpiderIdleTest(scrapy.Spider):
        custom_settings = {
            'CONCURRENT_REQUESTS': 1,
            'DOWNLOAD_DELAY': 2,
        }
    
        def __init__(self):
            dispatcher.connect(self.spider_idle, signals.spider_idle)
            self.idle_retries = 0
    
        def spider_idle(self, spider):
            self.idle_retries += 1
            if self.idle_retries < 3:
                spider.crawler.engine.crawl(
                    Request('https://www.google.com',
                            self.parse,
                            dont_filter=True),
                    spider)
                raise DontCloseSpider("Stayin' alive")
    
        def start_requests(self):
            yield Request('https://www.google.com', self.parse)
    
        def parse(self, response):
            print(response.css('title::text').extract_first())
    
    
    process = CrawlerProcess()
    process.crawl(SpiderIdleTest)
    process.crawl(SpiderIdleTest)
    process.crawl(SpiderIdleTest)
    process.start()
    

    【讨论】:

      猜你喜欢
      • 2017-07-25
      • 2020-08-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多