【问题标题】:How to scrape JSON web pages如何抓取 JSON 网页
【发布时间】:2019-11-10 16:56:31
【问题描述】:

嘿,所以我有一些抓取 html 但从不 json 的经验,所以我需要使用 scrapy,http://www.starcitygames.com/buylist/search?search-type=category&id=5061 抓取以下网页,我在网上找到了一个教程,它使用 scrapy 和 jmspath 从网络上抓取 json 数据.我得到了教程,但我试图改变它以与我的网站一起工作,但没有成功。没有错误,但它不返回任何数据。任何帮助将不胜感激!

items.py

import scrapy


class NameItem(scrapy.Item):
    """User item definition for jsonplaceholder /LoginSpider endpoint."""
    name = scrapy.Field()
    condition = scrapy.Field()
    price = scrapy.Field()
    rarity = scrapy.Field()

LoginSpider.py

import scrapy
import json
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import NameItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, MapCompose, SelectJmes


class UserSpider(scrapy.Spider):
    """Spider to scrape `http://www.starcitygames.com/buylist/search?search-type=category&id=5061`."""
    name = 'LoginSpider'
    allowed_domains = ['http://www.starcitygames.com/buylist/search?search-type=category&id=5061']
    start_urls = ['http://www.starcitygames.com/buylist/search?search-type=category&id=5061']
    # dictionary to map UserItem fields to Jmes query paths
    jmes_paths = {
            'name': 'name',
            'condition': 'condition',
            'price': 'price',
            'rarity': 'rarity',
            }

    def parse(self, response):
        jsonresponse = json.loads(response.body_as_unicode())
        for user in jsonresponse:
            loader = ItemLoader(item=NameItem())  # create an ItemLoader to populate a NameItem
            loader.default_input_processor = MapCompose(str)  # apply str conversion on each value
            loader.default_output_processor = Join(' ')
            for (field, path) in self.jmes_paths.items():
                loader.add_value(field, SelectJmes(path)(user))
            yield loader.load_item()

【问题讨论】:

    标签: python json scrapy


    【解决方案1】:

    这个urlhttp://www.starcitygames.com/buylist/search?search-type=category&id=5061has的响应3级:

    1. '好的'
    2. '搜索'
    3. 'results' ##这个包含数据

    并且 results 键具有多个值,您应该对其进行迭代。 值里面是数据。 试试这段代码,希望对你有帮助。

    这是模块 items.py

    class SoResponseItem(scrapy.Item):
            name = scrapy.Field()
            condition = scrapy.Field()
            price = scrapy.Field()
            rarity = scrapy.Field()
    

    这是蜘蛛

    import scrapy
    import json
    from SO_response.items import SoResponseItem
    
    class LoginspiderSpider(scrapy.Spider):
        name = 'LoginSpider'
        allowed_domains = ['www.starcitygames.com']
        url = 'http://www.starcitygames.com/'
    
        def start_requests(self):
            yield scrapy.Request(url=self.url, callback=self.parse)
    
        def parse(self, response):
            url = response.urljoin('buylist/search?search-type=category&id=5061')
            yield scrapy.Request(url=url, callback=self.parse_data)
    
        def parse_data(self, response):
            jsonreponse = json.loads(response.body)
            for result in jsonreponse['results']:
                for index in range(len(result)):
                    items = SoResponseItem()
                    items['name'] = result[index]['name']
                    items['condition'] = result[index]['condition']
                    items['price'] = result[index]['price']
                    items['rarity'] = result[index]['rarity']
                    yield items
    

    在你的 shell 中尝试: scrapy crawl -o jmes.json

    【讨论】:

      猜你喜欢
      • 2018-04-02
      • 1970-01-01
      • 2018-12-13
      • 2021-08-29
      • 1970-01-01
      • 2020-06-18
      • 2019-06-25
      相关资源
      最近更新 更多