【问题标题】:How to schedule Scrapy crawl execution programmatically如何以编程方式安排 Scrapy 抓取执行
【发布时间】:2018-05-13 03:29:54
【问题描述】:

我想创建一个调度程序脚本以按顺序多次运行同一个蜘蛛。

到目前为止,我得到了以下信息:

#!/usr/bin/python3
"""Scheduler for spiders."""
import time

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from my_project.spiders.deals import DealsSpider


def crawl_job():
    """Job to start spiders."""
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(DealsSpider)
    process.start() # the script will block here until the end of the crawl


if __name__ == '__main__':

    while True:
        crawl_job()
        time.sleep(30) # wait 30 seconds then crawl again

现在蜘蛛第一次正确执行,然后在时间延迟之后,蜘蛛再次启动,但在它开始抓取之前我收到以下错误消息:

Traceback (most recent call last):
  File "scheduler.py", line 27, in <module>
    crawl_job()
  File "scheduler.py", line 17, in crawl_job
    process.start() # the script will block here until the end of the crawl
  File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 285, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

不幸的是,我不熟悉Twisted 框架及其Reactors,因此我们将不胜感激!

【问题讨论】:

    标签: python-3.x web-scraping scrapy twisted


    【解决方案1】:

    您收到 ReactorNotRestartable 错误,因为无法在 Twisted 中多次启动 Reactor。基本上,每次调用process.start(),都会尝试启动reactor。网络上有很多关于此的信息。这是一个简单的解决方案:

    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.project import get_project_settings
    
    from my_project.spiders.deals import DealsSpider
    
    
    def crawl_job():
        """
        Job to start spiders.
        Return Deferred, which will execute after crawl has completed.
        """
        settings = get_project_settings()
        runner = CrawlerRunner(settings)
        return runner.crawl(DealsSpider)
    
    def schedule_next_crawl(null, sleep_time):
        """
        Schedule the next crawl
        """
        reactor.callLater(sleep_time, crawl)
    
    def crawl():
        """
        A "recursive" function that schedules a crawl 30 seconds after
        each successful crawl.
        """
        # crawl_job() returns a Deferred
        d = crawl_job()
        # call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
        d.addCallback(schedule_next_crawl, 30)
        d.addErrback(catch_error)
    
    def catch_error(failure):
        print(failure.value)
    
    if __name__=="__main__":
        crawl()
        reactor.run()
    

    与您的 sn-p 有一些明显的不同。 reactor 被直接调用,用CrawlerProcess 替换CrawlerRunnertime.sleep 已被移除,因此反应器不会阻塞,while 循环已被替换为对crawl 函数的连续调用通过callLater。它很短,应该做你想做的。如果有任何部分让您感到困惑,请告诉我,我会详细说明。

    更新 - 在特定时间抓取

    import datetime as dt
    
    def schedule_next_crawl(null, hour, minute):
        tomorrow = (
            dt.datetime.now() + dt.timedelta(days=1)
            ).replace(hour=hour, minute=minute, second=0, microsecond=0)
        sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
        reactor.callLater(sleep_time, crawl)
    
    def crawl():
        d = crawl_job()
        # crawl everyday at 1pm
        d.addCallback(schedule_next_crawl, hour=13, minute=30)
    

    【讨论】:

    • 我像python3 scheduler.py 一样运行脚本,但它一直处于空闲状态,什么也不做。可能是什么问题?
    • 如果不深入研究代码,很难知道问题出在哪里。在函数中放置print 语句或断点,看看它在哪里空闲。
    • 我从家里重新运行了脚本,现在它引发了这样的异常:Unhandled error in Deferred: Traceback (most recent call last): File "scheduler.py", line 20, in crawl_job return runner.crawl(DealsSpider)
    • 添加了如何在特定时间抓取的示例。在示例中,它将在第二天的 1:30 (13:30) 爬行。但是,请考虑使用cron 来安排任务。
    • 感谢您的建议和更新!目前我使用cron btw。我想将CrawlerRunnerschedule 包一起使用,但似乎我必须退回到自制调度程序以避免与Twisted 发生冲突。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-02-09
    • 1970-01-01
    • 2015-03-14
    • 2020-02-11
    • 1970-01-01
    • 1970-01-01
    • 2017-10-03
    相关资源
    最近更新 更多