【发布时间】:2021-02-13 23:56:51
【问题描述】:
我正在尝试运行一个scrapy bot,它将为列表中给出的每个url重复运行蜘蛛。我到现在写的代码如下
def run_spider(url_list,allowed_list):
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('scraper',start_urls=url_list, allowed_domains=allowed_list)
d.addBoth(lambda _: reactor.stop())
reactor.run()
for start, allowed in zip(start_url,allowedUrl):
url_list = []
allowed_list = []
url_list.append(start)
allowed_list.append(allowed)
print(type(url_list),type(allowed_list))
run_spider(url_list,allowed_list)
蜘蛛本身在第一个 url 上运行良好,但一旦循环命中它就会给出错误twisted.internet.error.ReactorNotRestartable,完整的回溯在这里:
Traceback (most recent call last):
File "C:\brox\Crawler\main.py", line 34, in <module>
run_spider(url_list,allowed_list)
File "C:\brox\Crawler\main.py", line 24, in run_spider
reactor.run()
File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 1282, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 1262, in startRunning
ReactorBase.startRunning(self)
File "C:\brox\Crawler\venv\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
我遵循文档中描述的方法,但是如何为循环中的每个项目重新启动蜘蛛。任何建议都会非常有帮助。
P:S:: 当简单地传递允许的域并启动 url 时,蜘蛛机器人本身就可以正常工作
【问题讨论】: