在 Flask 应用程序的后台运行爬虫答案

【问题标题】：Running a scrapy spider in the background in a Flask app在 Flask 应用程序的后台运行爬虫
【发布时间】：2014-03-20 17:17:01
【问题描述】：

我正在构建一个使用 Flask 和 Scrapy 的应用程序。当我的应用程序的根 URL 被访问时，它会处理一些数据并显示它。此外，如果它还没有运行，我还想（重新）启动我的蜘蛛。由于我的蜘蛛需要大约 1.5 小时才能完成运行，所以我使用 threading 作为后台进程运行它。这是一个最小的例子（你还需要testspiders）：

import os
from flask import Flask, render_template
import threading
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings    
from testspiders.spiders.followall import FollowAllSpider

def crawl():
    spider = FollowAllSpider(domain='scrapinghub.com')
    crawler = Crawler(Settings())
    crawler.configure()
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()

app = Flask(__name__)

@app.route('/')
def main():
    run_in_bg = threading.Thread(target=crawl, name='crawler')
    thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]

    if 'crawler' not in thread_names:
        run_in_bg.start()

    return 'hello world'

if __name__ == "__main__":
    port = int(os.environ.get('PORT', 5000))
    app.run(host='0.0.0.0', port=port)

作为旁注，以下几行是我尝试确定我的爬虫线程是否仍在运行的临时方法。如果有更惯用的方法，我将不胜感激。

run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]

if 'crawler' not in thread_names:
    run_in_bg.start()

继续解决问题——如果我将上述脚本保存为crawler.py，运行python crawler.py 并访问localhost:5000，那么我会收到以下错误（忽略scrapy 的HtmlXPathSelector 弃用错误）：

exceptions.ValueError: signal only works in main thread

虽然蜘蛛运行了，但它并没有停止，因为signals.spider_closed 信号只在主线程中起作用（根据这个错误）。正如预期的那样，对根 URL 的后续请求会导致大量错误。

如果我的应用程序尚未开始抓取，我如何设计我的应用程序以启动我的蜘蛛程序，同时立即将控制权返回给我的应用程序（即我不想等待抓取程序完成）以获取其他内容?

【问题讨论】：

我在这里添加了类似问题的答案：stackoverflow.com/questions/36384286/…

标签： python flask scrapy python-multithreading

【解决方案1】：

让烧瓶像这样启动长时间运行的线程并不是最好的主意。

我建议使用像 celery 或 rabbitmq 这样的队列系统。您的烧瓶应用程序可以将您想要在后台执行的任务放入队列中，然后立即返回。

然后，您可以让主应用程序之外的工作人员处理这些任务并完成您所有的抓取。

【讨论】：

感谢有关 Celery 的提示。我实现了它，虽然它似乎可以工作（并且看起来比我的线程解决方案更好/更干净），但我仍然遇到相同的问题，即反应器无法重新启动或发送停止信号。我会尝试更多地摆弄它，如果我能让它工作，我会接受。