【问题标题】:Scrapy yield items as sub-items in JSONScrapy 将项目作为 JSON 中的子项目
【发布时间】:2017-07-25 10:45:59
【问题描述】:

我如何告诉 Scrapy 将所有产生的项目分成两个列表?例如,假设我有两种主要类型的项目 - articleauthor。我想将它们放在两个单独的列表中。现在我得到输出 JSON:

[
  {
    "article_title":"foo",
    "article_published":"1.1.1972",
    "author": "John Doe"
  },
  {
    "name": "John Doe",
    "age": 42,
    "email": "foo@example.com"
  }
]

如何将它转换成这样的东西?

{
  "articles": [
    {
      "article_title": "foo",
      "article_published": "1.1.1972",
      "author": "John Doe"
    }
  ],
  "authors": [
    {
      "name": "John Doe",
      "age": 42,
      "email": "foo@example.com"
    }
  ]
}

我输出这些的函数很简单,类似于:

def parse_author(self, response):
        name = response.css('div.author-info a::text').extract_first()
        print("Parsing author: {}".format(name))

        yield {
            'author_name': name
        }

【问题讨论】:

    标签: python json scrapy


    【解决方案1】:

    项目将分别到达管道并使用此设置相应地添加每个项目:

    items.py

    class Article(scrapy.Item):
        title = scrapy.Field()
        published = scrapy.Field()
        author = scrapy.Field()
    
    class Author(scrapy.Item):
        name = scrapy.Field()
        age = scrapy.Field()
    

    蜘蛛.py

    def parse(self, response):
    
        author = items.Author()
        author['name'] = response.css('div.author-info a::text').extract_first()
        print("Parsing author: {}".format(author['name']))
        yield author
    
        article = items.Article()
        article['title'] = response.css('article css').extract_first()
        print("Parsing article: {}".format(article['title']))
    
        yield article
    

    管道.py

    process_item(self, item, spider):
        if isinstance(item, items.Author):
            # Do something to authors
        elif isinstance(item, items.Article):
            # Do something to articles
    

    我建议采用这种架构:

    [{
        "title": "foo",
        "published": "1.1.1972",
        "authors": [
            {
            "name": "John Doe",
            "age": 42,
            "email": "foo@example.com"
            },
            {
            "name": "Jane Doe",
            "age": 21,
            "email": "bar@example.com"
            },
        ]
    }]
    

    这使它成为一个项目。

    items.py

    class Article(scrapy.Item):
        title = scrapy.Field()
        published = scrapy.Field()
        authors = scrapy.Field()
    

    蜘蛛.py

    def parse(self, response):
    
        authors = []
        author = {}
        author['name'] = "John Doe"
        author['age'] = 42
        author['email'] = "foo@example.com"
        print("Parsing author: {}".format(author['name']))
        authors.append(author)
    
        article = items.Article()
        article['title'] = "foo"
        article['published'] = "1.1.1972"
        print("Parsing article: {}".format(article['title']))
        article['authors'] = authors
        yield article
    

    【讨论】:

    • 我仍然不确定如何将给定类型的所有项目分组到一个 JSON 键下。修改管道以返回 {'author': item} 仍会为每个项目创建一个 author 键。我想我需要在我自己的列表中的某个地方累积所有项目,然后最后将它们作为 JSON 输出,但我不知道去哪里。 ::: 如果我想主要遍历文章,您建议的架构很好。例如,列出所有作者变得更加困难。
    【解决方案2】:
    raw = [
        {
            "article_title":"foo",
            "article_published":"1.1.1972",
            "author": "John Doe"
        },
        {
            "name": "John Doe",
            "age": 42,
            "email": "foo@example.com"
        }
    ]
    
    data = {'articles':[], "authors":[]}
    
    for a in raw:
    
        if 'article_title' in a:
            data['articles'].extend([ a ])
    
        else:
            data['articles'].extend([ a ])
    

    【讨论】:

    • 我不知道如何在 Scrapy 中处理这样的字典。来自解析函数的yielding 将字典直接传递给 Scrapy,我最终无法处理它。你能扩展你的答案吗?
    • @MartinMelka 进程意味着哪里?抱歉,我没有收到您的问题...我的理解是,您的数据应该可以通过item['articles'] 在管道中访问
    猜你喜欢
    • 1970-01-01
    • 2016-12-09
    • 1970-01-01
    • 2014-02-20
    • 2017-11-23
    • 2023-03-05
    • 2017-11-21
    • 2016-11-22
    • 2015-04-02
    相关资源
    最近更新 更多