Scrapy - 输出到多个 JSON 文件答案

【问题标题】：Scrapy - Output to Multiple JSON filesScrapy - 输出到多个 JSON 文件
【发布时间】：2015-12-28 11:59:48
【问题描述】：

我对 Scrapy 还是很陌生。我正在研究使用它来抓取整个网站的链接，我会将这些项目输出到多个 JSON 文件中。所以我可以将它们上传到 Amazon Cloud Search 进行索引。是否可以将项目拆分为多个文件，而不是最终只有一个大文件？根据我的阅读，项目导出器只能输出到每个蜘蛛一个文件。但我只使用一个 CrawlSpider 来完成这项任务。如果我可以对每个文件中包含的项目数量设置一个限制，例如 500 或 1000，那就太好了。

这是我目前设置的代码（基于教程中使用的 Dmoz.org）：

dmoz_spider.py

import scrapy

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import DmozItem

class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/",
    ]

    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)]

    def parse_item(self, response):
       for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

items.py

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

感谢您的帮助。

【问题讨论】：

标签： python json scrapy

【解决方案1】：

我认为内置的提要导出器不支持写入多个文件。

一种选择是基本上导出到jsonlines format 中的单个文件，每行一个JSON 对象，便于管道和拆分。

然后，分别在爬取完成后，您可以read the file in the desired chunks并写入单独的JSON文件。

然后我可以将它们上传到 Amazon Cloud Search 进行索引。

请注意，有一个直接的Amazon S3 exporter（不确定是否有帮助，仅供参考）。

【讨论】：

我也在考虑在抓取完成后将其拆分为单独的 JSON 文件。听起来是最好的选择。感谢您的建议。
我什至不知道有一个 Amazon S3 导出器。我也一定会调查的。再次感谢！