【问题标题】:Scrapy's JSON output forms an array of JSON objectsScrapy 的 JSON 输出形成一个 JSON 对象数组
【发布时间】:2021-05-12 11:49:20
【问题描述】:

我正在尝试使用 Scrapy 抓取游戏信息网站。抓取过程如下:抓取类别 -> 抓取游戏列表(每个类别有多个页面) -> 抓取游戏信息。 抓取的信息应该进入一个 json 文件。我得到以下结果:

[
    {"category": "cat1", "games": [...]},
    {"category": "cat2", "games": [...]},
    ...
]

但我想得到这个结果:

{ "categories":
    [
        {"category": "cat1", "games": [...]},
        {"category": "cat2", "games": [...]},
        ...
    ]
}

我尝试使用this postthis post 中的步骤,但没有成功。找不到更多相关问题。

我将不胜感激。

我的蜘蛛:

import scrapy
from ..items import Category, Game

class GamesSpider(scrapy.Spider):
    name = 'games'
    start_urls = ['https://www.example.com/categories']
    base_url = 'https://www.exmple.com'

    def parse(self, response):
        categories = response.xpath("...")

        for category in categories:
            cat_name = category.xpath(".//text()").get()
            url = self.base_url + category.xpath(".//@href").get()    
            
            cat = Category()
            cat['category'] = cat_name
            
            yield response.follow(url=url, 
                                  callback=self.parse_category, 
                                  meta={ 'category': cat })

    def parse_category(self, response):
        games_url_list = response.xpath('//.../a/@href').getall()

        cat = response.meta['category']
        url = self.base_url + games_url_list.pop()
        next_page = response.xpath('//a[...]/@href').get()
        
        if next_page:
            next_page = self.base_url + response.xpath('//a[...]/@href').get()

        yield response.follow(url=url, 
                              callback=self.parse_game, 
                              meta={'category': cat, 
                                    'games_url_list': games_url_list, 
                                    'next_page': next_page})
            
    def parse_game(self, response):
        cat = response.meta['category']
        game = Game()

        try:
            cat['games_list']
        except:
            cat['games_list'] = []
        
        game['title_en'] = response.xpath('...')
        game['os'] = response.xpath('...')
        game['users_rating'] = response.xpath('...')
 
        cat['games_list'].append(game)

        games_url_list = response.meta['games_url_list']
        next_page = response.meta['next_page']
        
        if games_url_list: 
            url = self.base_url + games_url_list.pop()
            yield response.follow(url=url, 
                                  callback=self.parse_game, 
                                  meta={'category': cat, 
                                        'games_url_list': games_url_list, 
                                        'next_page': next_page})

        else:
            if next_page:
                yield response.follow(url=next_page, 
                                      callback=self.parse_category, 
                                      meta={'category': cat})
            else:
                yield cat

我的 item.py 文件:

import scrapy

class Category(scrapy.Item):
    category = scrapy.Field()
    games_list = scrapy.Field()

class Game(scrapy.Item):
    title_en = scrapy.Field()
    os = scrapy.Field()
    users_rating = scrapy.Field()

【问题讨论】:

    标签: python json web-scraping scrapy


    【解决方案1】:

    你需要写一个自定义的item exporter,或者单独处理Scrapy生成的文件的后处理,例如使用独立的 Python 脚本将输出格式转换为所需格式。

    【讨论】:

      猜你喜欢
      • 2014-10-28
      • 1970-01-01
      • 2017-04-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-09-22
      • 2016-11-12
      • 1970-01-01
      相关资源
      最近更新 更多