【问题标题】:Using Scrapy to scrape nested JSON data?使用 Scrapy 抓取嵌套的 JSON 数据?
【发布时间】:2016-04-01 22:17:34
【问题描述】:

我正在尝试编写一个从索尼的 PlayStation 商店抓取信息的网络应用程序。我找到了包含我想要的数据的 JSON 文件,但我想知道如何使用 Scrapy 仅存储 JSON 文件的某些元素?

以下是部分 JSON 数据:

{
  "age_limit":0,
  "attributes":{
       "facets":{
          "platform":[
              {"name":"PS4™","count":96,"key":"ps4"},
              {"name":"PS3™","count":5,"key":"ps3"},
              {"name":"PS Vita","count":7,"key":"vita"},
          ]
       }
     }
    }

我只想要“名称”PS4 的“计数”值。我如何在 Scrapy 中得到这个?到目前为止,这是我的 Scrapy 代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crossbuy.items import PS4Vita


class PS4VitaSpider(BaseSpider):
    name = "ps4vita" # Name of the spider, to be used when crawling
    allowed_domains = ["store.playstation.com"] # Where the spider is allowed to     go
    start_url = "https://store.playstation.com/chihiro-api/viewfinder/US/en/999/STORE-MSF77008-9_PS4PSVCBBUNDLE?size=30&gkb=1&geoCountry=US"

    def parse(self, response):
        jsonresponse = json.loads(response)

        pass # To be changed later

谢谢!

【问题讨论】:

  • 你不能以正常方式访问 {"name": "PS4} 吗?例如[ p["count"] for p in jsonresponse["attributes"]["facets"]["platform"] if p["name"] == "PS4™" ]

标签: python json scrapy


【解决方案1】:
...
def parse(self, response):
    jsonresponse = json.loads(response.body)
    my_count = None
    for platform in jsonresponse['attributes']['facets']['platform']:
        if 'PS4' in platform['name']:
            my_count = platform['count']

    yield dict(count=my_count)
...

【讨论】:

    【解决方案2】:

    像访问 python 字典一样访问 json 数据:

    # To get a list of the counts:
    counts = [x['count'] for x in jsonresponse['attributes']['facets']['platform']]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-08-04
      • 1970-01-01
      • 2019-03-17
      • 1970-01-01
      • 1970-01-01
      • 2018-11-21
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多