【问题标题】:Get values within keys with item loader scrapy使用项目加载器scrapy获取键中的值
【发布时间】:2022-01-22 07:30:56
【问题描述】:

我正在尝试从网页响应页面中的键中提取一些值。不幸的是,当我这样做时,它只返回键,我似乎无法获取值。因为每个键都是一个很长的列表并且它们被编号,我似乎无法弄清楚如何获取所有键的值。

例如,这是我的工作代码:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst

class DepopItem(scrapy.Item):
    brands = Field(output_processor=TakeFirst())

class DepopSpider(scrapy.Spider):
    name = 'depop'
    allowed_domains = ["depop.com"]
    start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb&currency=GBP&sort=relevance']

    
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
    }
    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url, 
                callback=self.parse,
             )

    def parse(self, response):
        resp= response.json()['brands']
        for item in resp:
            loader = ItemLoader(DepopItem(), selector=item)
            loader.add_value('brands', item)
 
            yield loader.load_item()

这会返回一个键列表:

{"brands": "1"}
{"brands": "2"}
{"brands": "3"}
{"brands": "4"}
{"brands": "5"}
{"brands": "7"}
{"brands": "9"}

相反,我想要与这些键对应的值:

{"brands": 946}
{"brands": 2376}
{"brands": 1286}
{"brands": 2774}
{"brands": 489}
{"brands": 11572}
{"brands": 1212}

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    使用values()resp[item]

    例子:

    import scrapy
    from scrapy.loader import ItemLoader
    from scrapy.item import Field
    from itemloaders.processors import TakeFirst
    
    
    class DepopItem(scrapy.Item):
        brands = Field(output_processor=TakeFirst())
    
    
    class DepopSpider(scrapy.Spider):
        name = 'depop'
        allowed_domains = ["depop.com"]
        start_urls = ['https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb&currency=GBP&sort=relevance']
    
        custom_settings = {
            'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        }
    
        def parse(self, response):
            resp = response.json()['brands']
            for item in resp.values():
                loader = ItemLoader(DepopItem(), selector=item)
                loader.add_value('brands', item['count'])
                yield loader.load_item()
    

    输出:

    {'brands': 888}
    {'brands': 1}
    {'brands': 52}
    {'brands': 138}
    {'brands': 148}
    ...
    ...
    ...
    

    【讨论】:

    • 啊,太简单了!不过,我永远不会得到它。谢谢!
    【解决方案2】:

    我不确定 scrapy 怎么样,但你可以这样做:

    import requests
    import json
    from itertools import starmap
    from requests.models import Response
    from typing import Dict, List
    
    
    url = "https://webapi.depop.com/api/v2/search/filters/aggregates/?brands=1596&itemsPerPage=24&country=gb&currency=GBP&sort=relevance"
    resp: Response = requests.get(url)
    data: Dict = json.loads(resp.text).get("brands")
    values: List[Dict] = list(starmap(lambda k,v: {"brands": v["count"]}, data.items()))
    

    输出:

    [{'brands': 989},
     {'brands': 1838},
     {'brands': 2415},
     {'brands': 1344},
     ...]
    

    【讨论】:

    • 我知道这种方法是我目前正在做的,但我特别希望通过它来提高我的技能。感谢您的尝试!
    猜你喜欢
    • 1970-01-01
    • 2016-10-08
    • 1970-01-01
    • 2014-10-03
    • 1970-01-01
    • 2019-02-18
    • 2018-03-19
    • 2023-03-05
    • 1970-01-01
    相关资源
    最近更新 更多