【问题标题】:how to store yield request responses on scrapy如何在scrapy上存储产量请求响应
【发布时间】:2017-04-14 08:25:00
【问题描述】:

您好,我是 python 和 scrapy 的新手。所以这将是一个菜鸟问题。我也尝试过搜索,但找不到任何可以直接回答我问题的内容。 我正在尝试浏览以下国家/地区的网页并将其人口存储在一个数组中,然后立即打印它们。如您所见,下面的代码在每次发出请求时打印。我怎样才能用结果数组批量打印呢?谢谢

class CrawlerSpider(scrapy.Spider):
    name = 'wikiCrawler'
    #allowed_domains = ['web']
    start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states']
    #counter = 1
    global i
    i = {}
    global list
    list = []

    def __init__(self):
        self.counter = 1
        pass

    def parse(self, response):

        for resultHref in response.xpath('//table[contains(@class, "wikitable")]//a[preceding-sibling::span[@class="flagicon"]]'):
            href = resultHref.xpath('./@href').extract_first()
            nameC = resultHref.xpath('./text()').extract_first()
            yield scrapy.Request(response.urljoin(href), callback=self.parse_item, meta={'Country': nameC})

    def parse_item(self, response):
        self.counter = self.counter + 1
        i['country'] = response.meta['Country']
        i['population'] = response.xpath('//tr[preceding-sibling::tr/th/a/text()="Population"]/td/text()').extract_first()
        yield i #this is where I would like to store the data instead of printing and then later print all together

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    parse_item 函数而不是类中创建了 i 变量。 对此进行了测试,它可以工作,尽管 xpath 选择器可能需要一些改进。

    class CrawlerSpider(scrapy.Spider):
        name = 'wikiCrawler'
        #allowed_domains = ['web']
        start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states']
        #counter = 1
        global list
        list = []
    
        def __init__(self):
            self.counter = 1
            pass
    
        def parse(self, response):
    
            for resultHref in response.xpath('//table[contains(@class, "wikitable")]//a[preceding-sibling::span[@class="flagicon"]]'):
                href = resultHref.xpath('./@href').extract_first()
                nameC = resultHref.xpath('./text()').extract_first()
                yield scrapy.Request(response.urljoin(href), callback=self.parse_item, meta={'Country': nameC})
    
        def parse_item(self, response):
            i = {}
            self.counter = self.counter + 1
            i['country'] = response.meta['Country']
            i['population'] = response.xpath('//tr[preceding-sibling::tr/th/a/text()="Population"]/td/text()').extract_first()
            yield i #this is where I would like to store the data instead of printing and then later print all together
    

    【讨论】:

    猜你喜欢
    • 2019-09-25
    • 1970-01-01
    • 1970-01-01
    • 2017-07-03
    • 1970-01-01
    • 1970-01-01
    • 2023-03-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多