Scrapy - 一次运行多个蜘蛛 - CrawlerProcess - 文件结构答案

【问题标题】：Scrapy - Running Multiple Spiders at Once - CrawlerProcess - File StructureScrapy - 一次运行多个蜘蛛 - CrawlerProcess - 文件结构
【发布时间】：2020-08-12 17:07:05
【问题描述】：

我正在尝试使用 CrawlerProcess 一次运行多个 Scrapy Spider，但不确定文件结构。当通过scrapy crawl indeed 和scrapy crawl monster（我的蜘蛛类的指定名称）单独运行时，两个蜘蛛都能正常工作。

我目前的文件结构如下：

- scrapy
  - tutorial
    - spiders
      - __init__.py
      - indeed_spider.py
      - monster_spider.py
    - __init__.py
    - crawler.py
    - functions.py
    - items.py
    - middlewares.py
    - pipelines.py
    - settings.py
  - scrapy.cfg

如您所见，我的 crawler.py 设置在 tutorial 主目录中。

crawler.py的代码如下：

from scrapy.crawler import CrawlerProcess
from tutorial.spiders.indeed_spider import IndeedSpider
from tutorial.spiders.monster_spider import MonsterSpider
from scrapy.utils.project import get_project_settings

settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(IndeedSpider)
process.crawl(MonsterSpider)
process.start()

当我进入 tutorial 目录并运行python crawler.py 时，我收到以下错误消息：

Traceback (most recent call last):
  File "crawler.py", line 3, in <module>
    from tutorial.spiders.indeed_spider import IndeedSpider
ModuleNotFoundError: No module named 'tutorial'

这很奇怪，因为显然有一个tutorial 模块。 Scrapy 文档中没有关于文件结构和一次运行多个蜘蛛的内容；它给出了一个帮助不大的基本示例 (crawler doc)。

我的问题是：

如何在命令行上通过CrawlerProcess 运行多个蜘蛛？它不是scrapy crawl {spider_name}。我认为它是 python crawler.py，但考虑到我目前的结构，这不起作用。
crawler.py 应该存储在项目目录的什么位置？
是否需要进一步操作 pipelines.py 或 settings.py 才能启动 CrawlerProcess？

非常感谢您的帮助！

【问题讨论】：

不要进入 tutorial 目录，否则您将从 Python 导入路径中删除 tutorial。运行python tutorial/crawler.py，或者将包含tutorial文件夹的路径添加到PYTHONPATH环境变量中。

标签： python scrapy file-structure

【解决方案1】：

你做对了，错误是关于python的导入执行。

Traceback (most recent call last):
  File "crawler.py", line 3, in <module>
    from tutorial.spiders.indeed_spider import IndeedSpider
ModuleNotFoundError: No module named 'tutorial'

这是 python 中众所周知的障碍。我建议你在scrapy的目录下创建一个setup.py，代码如下：

from setuptools import setup, find_packages

setup(name='nameofproject', version='version', packages=find_packages())

所以你的结构应该是：

> - scrapy
>   - setup.py
>   - tutorial
>     - spiders
>       - __init__.py
>       - indeed_spider.py
>       - monster_spider.py
>     - __init__.py
>     - crawler.py
>     - functions.py
>     - items.py
>     - middlewares.py
>     - pipelines.py
>     - settings.py
>   - scrapy.cfg

然后，您必须在您的 shell 中在 scrapy 的目录中执行以下命令：

pip install -e .

现在python的解释器应该能够将tutorial识别为一个模块。

【讨论】：