我可以在不同的进程（并行）中运行具有不同设置的scrapy spider吗？答案

【问题标题】：Can I run scrapy spider with different setting in different process(parallel)?我可以在不同的进程（并行）中运行具有不同设置的scrapy spider吗？
【发布时间】：2017-04-27 10:57:05
【问题描述】：

我定义了一个name='myspider'的蜘蛛，它的行为会根据设置而有所不同。我想在不同的进程中运行不同实例的蜘蛛，可以吗？

我检查了源代码，似乎 SpiderLoader 只是遍历了 spiders 模块，我可以一次只运行一个同名的spider。

运行代码似乎：

for item in items:
    settings = get_project_settings()
    settings.set('item', item)
    settings.set('DEFAULT_REQUEST_HEADERS', item.get('request_header'))
    process = CrawlerProcess(settings)
    process.crawl("myspider")
    process.start()

当然，错误显示：

Traceback (most recent call last):
  File "/home/xuanqi/workspace/github/foolcage/fospider/fospider/main.py", line 44, in <module>
    process.start()  # the script will block here until the crawling is finished
  File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

提前感谢您的帮助！

【问题讨论】：

标签： scrapy

【解决方案1】：

设置不能在运行时更改。我建议你使用蜘蛛参数将不同的变量传递给蜘蛛。

process = CrawlerProcess(settings)
process.crawl("myspider", request_headers='specified headers...')
process.start()

为此，您必须覆盖蜘蛛的 init 函数以接受这些变量。并将 request_header 传递给您在蜘蛛中使用的每个 Request 对象。

def __init__(self, **kw):
    super(MySpider, self).__init__(**kw)
    self.headers = kw.get('request_headers')
    ...
yield scrapy.Request(url='www.example.com', headers=self.headers)

【讨论】：