【问题标题】:Scrapy 'twisted.internet.error.ReactorNotRestartable' error after first run首次运行后出现 Scrapy 'twisted.internet.error.ReactorNotRestartable' 错误
【发布时间】:2017-12-21 14:10:44
【问题描述】:

我正在使用 CrawlerProcess 从脚本运行 Scrapy (版本 1.4.0)。网址来自用户输入。第一次运行良好,但第二次出现 twisted.internet.error.ReactorNotRestartable 错误。所以,程序卡在那里。

爬虫进程部分:

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(GeneralSpider)

print('~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~')
process.start()
print('~~~~~~~~~~~~ Processing ended ~~~~~~~~~~')
process.stop()

这是第一次运行的输出:

~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
2017-07-17 05:58:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.some-url.com/content.php> (referer: None)
2017-07-17 05:58:46 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'HtmlResponse' in <GET http://www.some-url.com/content.php>
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-17 05:58:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 14223,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 17, 5, 58, 46, 760661),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'memusage/max': 49983488,
 'memusage/startup': 49983488,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 7, 17, 5, 58, 45, 162155)}
2017-07-17 05:58:46 [scrapy.core.engine] INFO: Spider closed (finished)
~~~~~~~~~~~~ Processing ended ~~~~~~~~~~

当我尝试第二次运行时,它会引发错误:

~~~~~~~~~~~~ Processing is going to be started ~~~~~~~~~~
[2017-07-17 06:03:18,075] ERROR in app: Exception on /scripts/1/process [GET]
Traceback (most recent call last):
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "api.py", line 13, in process_crawler
    processor.process()
  File "/var/www/python/crawlerapp/application/scripts/general_spider.py", line 124, in process
    process.start()
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 1242, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 1222, in startRunning
    ReactorBase.startRunning(self)
  File "/var/www/python/crawlerapp/appenv/lib/python3.5/site-packages/twisted/internet/base.py", line 730, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

每个进程完成后如何重启reactor或停止reactor?

Stack Overflow 中有一些类似的问题,但有针对旧版本 Scrapy 的解决方案。无法使用这些解决方案。

【问题讨论】:

标签: python python-3.x scrapy twisted scrapy-spider


【解决方案1】:

尝试在单独的进程中启动您的函数:

from multiprocessing.context import Process

def crawl():
    crawler = CrawlerProcess(settings)
    crawler.crawl(MySpider)
    crawler.start()

process = Process(target=crawl)
process.start()
process.join()

【讨论】:

  • 我被困了好几个小时。我试图在 lambda 上运行 scrapy spiders 并尝试了几乎所有方法。没有任何效果。尝试了您的解决方案,效果很好。非常感谢大佬?
【解决方案2】:

您可以添加此行。

process.start(stop_after_crawl=False)

希望你的问题能得到解决

谢谢

【讨论】:

  • 试过了。但它卡在那里。进程不会停止并继续运行。
猜你喜欢
  • 2018-10-04
  • 2017-11-15
  • 1970-01-01
  • 2019-06-02
  • 1970-01-01
  • 2011-11-01
  • 1970-01-01
  • 1970-01-01
  • 2021-04-12
相关资源
最近更新 更多