【问题标题】:Restart scrapy for each url in list为列表中的每个 url 重新启动 scrapy
【发布时间】:2021-02-13 23:56:51
【问题描述】:

我正在尝试运行一个scrapy bot,它将为列表中给出的每个url重复运行蜘蛛。我到现在写的代码如下

def run_spider(url_list,allowed_list):
    runner = CrawlerRunner(get_project_settings())
    d = runner.crawl('scraper',start_urls=url_list, allowed_domains=allowed_list)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()



for start, allowed in zip(start_url,allowedUrl):
    url_list = []
    allowed_list = []
    url_list.append(start)
    allowed_list.append(allowed)
    print(type(url_list),type(allowed_list))
    run_spider(url_list,allowed_list) 

蜘蛛本身在第一个 url 上运行良好,但一旦循环命中它就会给出错误twisted.internet.error.ReactorNotRestartable,完整的回溯在这里:

Traceback (most recent call last):
  File "C:\brox\Crawler\main.py", line 34, in <module>
    run_spider(url_list,allowed_list)
  File "C:\brox\Crawler\main.py", line 24, in run_spider
    reactor.run()
  File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 1282, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
    ReactorBase.startRunning(self)
  File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我遵循文档中描述的方法,但是如何为循环中的每个项目重新启动蜘蛛。任何建议都会非常有帮助。

P:S:: 当简单地传递允许的域并启动 url 时,蜘蛛机器人本身就可以正常工作

【问题讨论】:

    标签: python scrapy twisted


    【解决方案1】:

    要使您的代码正常工作,您必须重新排列 reactor.run()reactor.stop() 逻辑。以下是解决问题的方法示例:

    from scrapy.crawler import CrawlerRunner
    from twisted.internet.defer import gatherResults
    from twisted.internet import reactor
    
    def run_spider(url_list, allowed_list):
        """
        :returns: Deferred
        """
        runner = CrawlerRunner(get_project_settings())
        return runner.crawl('scraper', start_urls=url_list, allowed_domains=allowed_list)
    
    d_list = []
    for start, allowed in zip(start_url, allowedUrl):
        # ... your logic ...
        # Append the deferred into a list.
        d_list.append(run_spider(url_list, allowed_list))
    
    # "Join"
    results = gatherResults(d_list)
    # Stop the reactor after all the sites are scraped or a failure occurs
    results.addBoth(lambda _: reactor.stop())
    
    reactor.run()
    

    run_spider() 返回一个Deferred。在循环中,将Deferred 附加到列表中并“加入”所有列表或在发生故障时停止处理(阅读gatherResults)。一旦站点全部被刮掉,反应堆就会停止。

    在网上搜索ReactorNotRestartable,这已经解释过很多次了。

    【讨论】:

      【解决方案2】:

      我也是scrapy的新手。但是我会在不重新启动爬虫的情况下编写这个爬虫(这是有问题的,因为它需要重新启动反应器)。像这样的:

      class MySpider(scrapy.Spider):
          name = 'spidery'
          allowed_domains = ['allowed_list']
          start_urls = url_list
          
          def parse(self, response):
             #here is what you want to scrape
      
      crawler = CrawlerProcess()
      crawler.crawl(MySpider)
      crawler.start()
      

      我希望应该这样做!

      【讨论】:

      • 正如我在问题中描述的蜘蛛机器人运行良好,我需要一种方法来停止并重新运行每个 url 的 scrapy。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-11-15
      • 2023-01-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-09-04
      • 1970-01-01
      相关资源
      最近更新 更多