Scrapy 将项目作为 JSON 中的子项目答案

【问题标题】：Scrapy yield items as sub-items in JSONScrapy 将项目作为 JSON 中的子项目
【发布时间】：2017-07-25 10:45:59
【问题描述】：

我如何告诉 Scrapy 将所有产生的项目分成两个列表？例如，假设我有两种主要类型的项目 - article 和 author。我想将它们放在两个单独的列表中。现在我得到输出 JSON：

[
  {
    "article_title":"foo",
    "article_published":"1.1.1972",
    "author": "John Doe"
  },
  {
    "name": "John Doe",
    "age": 42,
    "email": "foo@example.com"
  }
]

如何将它转换成这样的东西？

{
  "articles": [
    {
      "article_title": "foo",
      "article_published": "1.1.1972",
      "author": "John Doe"
    }
  ],
  "authors": [
    {
      "name": "John Doe",
      "age": 42,
      "email": "foo@example.com"
    }
  ]
}

我输出这些的函数很简单，类似于：

def parse_author(self, response):
        name = response.css('div.author-info a::text').extract_first()
        print("Parsing author: {}".format(name))

        yield {
            'author_name': name
        }

【问题讨论】：

标签： python json scrapy

【解决方案1】：

项目将分别到达管道并使用此设置相应地添加每个项目：

items.py

class Article(scrapy.Item):
    title = scrapy.Field()
    published = scrapy.Field()
    author = scrapy.Field()

class Author(scrapy.Item):
    name = scrapy.Field()
    age = scrapy.Field()

蜘蛛.py

def parse(self, response):

    author = items.Author()
    author['name'] = response.css('div.author-info a::text').extract_first()
    print("Parsing author: {}".format(author['name']))
    yield author

    article = items.Article()
    article['title'] = response.css('article css').extract_first()
    print("Parsing article: {}".format(article['title']))

    yield article

管道.py

process_item(self, item, spider):
    if isinstance(item, items.Author):
        # Do something to authors
    elif isinstance(item, items.Article):
        # Do something to articles

我建议采用这种架构：

[{
    "title": "foo",
    "published": "1.1.1972",
    "authors": [
        {
        "name": "John Doe",
        "age": 42,
        "email": "foo@example.com"
        },
        {
        "name": "Jane Doe",
        "age": 21,
        "email": "bar@example.com"
        },
    ]
}]

这使它成为一个项目。

items.py

class Article(scrapy.Item):
    title = scrapy.Field()
    published = scrapy.Field()
    authors = scrapy.Field()

蜘蛛.py

def parse(self, response):

    authors = []
    author = {}
    author['name'] = "John Doe"
    author['age'] = 42
    author['email'] = "foo@example.com"
    print("Parsing author: {}".format(author['name']))
    authors.append(author)

    article = items.Article()
    article['title'] = "foo"
    article['published'] = "1.1.1972"
    print("Parsing article: {}".format(article['title']))
    article['authors'] = authors
    yield article

【讨论】：

我仍然不确定如何将给定类型的所有项目分组到一个 JSON 键下。修改管道以返回 {'author': item} 仍会为每个项目创建一个 author 键。我想我需要在我自己的列表中的某个地方累积所有项目，然后最后将它们作为 JSON 输出，但我不知道去哪里。 ::: 如果我想主要遍历文章，您建议的架构很好。例如，列出所有作者变得更加困难。

【解决方案2】：

raw = [
    {
        "article_title":"foo",
        "article_published":"1.1.1972",
        "author": "John Doe"
    },
    {
        "name": "John Doe",
        "age": 42,
        "email": "foo@example.com"
    }
]

data = {'articles':[], "authors":[]}

for a in raw:

    if 'article_title' in a:
        data['articles'].extend([ a ])

    else:
        data['articles'].extend([ a ])

【讨论】：

我不知道如何在 Scrapy 中处理这样的字典。来自解析函数的yielding 将字典直接传递给 Scrapy，我最终无法处理它。你能扩展你的答案吗？
@MartinMelka 进程意味着哪里？抱歉，我没有收到您的问题...我的理解是，您的数据应该可以通过item['articles'] 在管道中访问