【发布时间】:2021-05-12 11:49:20
【问题描述】:
我正在尝试使用 Scrapy 抓取游戏信息网站。抓取过程如下:抓取类别 -> 抓取游戏列表(每个类别有多个页面) -> 抓取游戏信息。 抓取的信息应该进入一个 json 文件。我得到以下结果:
[
{"category": "cat1", "games": [...]},
{"category": "cat2", "games": [...]},
...
]
但我想得到这个结果:
{ "categories":
[
{"category": "cat1", "games": [...]},
{"category": "cat2", "games": [...]},
...
]
}
我尝试使用this post 和this post 中的步骤,但没有成功。找不到更多相关问题。
我将不胜感激。
我的蜘蛛:
import scrapy
from ..items import Category, Game
class GamesSpider(scrapy.Spider):
name = 'games'
start_urls = ['https://www.example.com/categories']
base_url = 'https://www.exmple.com'
def parse(self, response):
categories = response.xpath("...")
for category in categories:
cat_name = category.xpath(".//text()").get()
url = self.base_url + category.xpath(".//@href").get()
cat = Category()
cat['category'] = cat_name
yield response.follow(url=url,
callback=self.parse_category,
meta={ 'category': cat })
def parse_category(self, response):
games_url_list = response.xpath('//.../a/@href').getall()
cat = response.meta['category']
url = self.base_url + games_url_list.pop()
next_page = response.xpath('//a[...]/@href').get()
if next_page:
next_page = self.base_url + response.xpath('//a[...]/@href').get()
yield response.follow(url=url,
callback=self.parse_game,
meta={'category': cat,
'games_url_list': games_url_list,
'next_page': next_page})
def parse_game(self, response):
cat = response.meta['category']
game = Game()
try:
cat['games_list']
except:
cat['games_list'] = []
game['title_en'] = response.xpath('...')
game['os'] = response.xpath('...')
game['users_rating'] = response.xpath('...')
cat['games_list'].append(game)
games_url_list = response.meta['games_url_list']
next_page = response.meta['next_page']
if games_url_list:
url = self.base_url + games_url_list.pop()
yield response.follow(url=url,
callback=self.parse_game,
meta={'category': cat,
'games_url_list': games_url_list,
'next_page': next_page})
else:
if next_page:
yield response.follow(url=next_page,
callback=self.parse_category,
meta={'category': cat})
else:
yield cat
我的 item.py 文件:
import scrapy
class Category(scrapy.Item):
category = scrapy.Field()
games_list = scrapy.Field()
class Game(scrapy.Item):
title_en = scrapy.Field()
os = scrapy.Field()
users_rating = scrapy.Field()
【问题讨论】:
标签: python json web-scraping scrapy