【问题标题】:Can I run scrapy spider with different setting in different process(parallel)?我可以在不同的进程(并行)中运行具有不同设置的scrapy spider吗?
【发布时间】:2017-04-27 10:57:05
【问题描述】:

我定义了一个name='myspider'的蜘蛛,它的行为会根据设置而有所不同。我想在不同的进程中运行不同实例的蜘蛛,可以吗?

我检查了源代码,似乎 SpiderLoader 只是遍历了 spiders 模块,我可以一次只运行一个同名的spider。

运行代码似乎:

for item in items:
    settings = get_project_settings()
    settings.set('item', item)
    settings.set('DEFAULT_REQUEST_HEADERS', item.get('request_header'))
    process = CrawlerProcess(settings)
    process.crawl("myspider")
    process.start()

当然,错误显示:

Traceback (most recent call last):
  File "/home/xuanqi/workspace/github/foolcage/fospider/fospider/main.py", line 44, in <module>
    process.start()  # the script will block here until the crawling is finished
  File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

提前感谢您的帮助!

【问题讨论】:

    标签: scrapy


    【解决方案1】:

    设置不能在运行时更改。 我建议你使用蜘蛛参数将不同的变量传递给蜘蛛。

    process = CrawlerProcess(settings)
    process.crawl("myspider", request_headers='specified headers...')
    process.start()
    

    为此,您必须覆盖蜘蛛的 init 函数以接受这些变量。并将 request_header 传递给您在蜘蛛中使用的每个 Request 对象。

    def __init__(self, **kw):
        super(MySpider, self).__init__(**kw)
        self.headers = kw.get('request_headers')
        ...
    yield scrapy.Request(url='www.example.com', headers=self.headers)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-10-19
      • 1970-01-01
      • 2015-09-14
      • 1970-01-01
      • 2021-08-20
      • 1970-01-01
      • 1970-01-01
      • 2021-10-29
      相关资源
      最近更新 更多