【问题标题】:Can Server read Request.Meta data sent by scrapy?服务器可以读取scrapy发送的Request.Meta数据吗?
【发布时间】:2018-01-09 06:13:41
【问题描述】:

下面的代码基本上是 Amazon Spider 的示例。
我想知道亚马逊服务器(或任何其他服务器)是否知道我们传递给 scrapy Request.meta 的数据是什么。如果 Request.meta 没有与我们的请求一起传递,那么我们如何将元数据接收到我们的 response.meta 中。

谁能解释一下 scrapy request.meta 和 response.meta 的工作原理 吗?

import random
from HTMLParser import HTMLParser

import scrapy
from scrapy.crawler import CrawlerProcess

import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
from amazon.items import AmazonItem
from amazon.user_agents import user_agent_list


class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

    def strip_tags(html):
        s = MLStripper()
        s.feed(html)
        return s.get_data()


class Amazon(scrapy.Spider):
    allowed_domains = ['amazon.in']
    start_urls = ['http://www.amazon.in']
    name = 'amazon'

    def parse(self, response):
        product_detail = response.xpath('//li[@class="s-result-item  celwidget "]')
        for product in product_detail:
            asin = product.xpath('@data-asin').extract_first().encode('ascii', 'ignore')
            url = 'http://www.amazon.in/dp/' + asin
            brand = product.xpath('div/div/div/span[2]/text()').extract_first()
            if brand != 'Azani':
                request = scrapy.Request(url, callback=self.parse_product)
                request.meta['asin'] = asin
                yield request

            next_page = response.xpath('//a[@id="pagnNextLink"]/@href').extract_first()
            if next_page:
                next_page = 'http://www.amazon.in' + next_page
                request = scrapy.Request(next_page, callback=self.parse)
                yield request

    def offer_page(self, response):
        item = response.meta['item']
        seller = response.xpath('//div[@class="a-row a-spacing-mini olpOffer"]/div/h3/span/a/text()').extract()
        price = response.xpath('//div[@class="a-row a-spacing-mini olpOffer"]/div/span/span/text()').extract()
        seller_price = zip(seller, price)
        item['brand'] = response.xpath('//div[@id="olpProductByline"]/text()').extract_first().strip().replace('by ',
                                                                                                               '')
        item['price'] = '{}'.format(seller_price)
        item['no_of_seller'] = len(seller_price)
        yield item

    def parse_product(self, response):
        def html_to_text(html):
            s = MLStripper()
            s.feed(html)
            return s.get_data()

        asin = response.meta['asin']
        item = AmazonItem()
        item['asin'] = asin
        item['product_name'] = response.xpath('//*[@id="productTitle"]/text()').extract_first().strip()
        item['bullet_point'] = html_to_text(
            response.xpath('//*[@id="feature-bullets"]').extract_first()).strip()
        item['description'] = html_to_text(response.xpath('//*[@id="productDescription"]').extract_first()).strip()
        child_asins = response.xpath('//*[@class="dropdownAvailable"]/@value').extract()
        child_asins = map(lambda x: x.split(',')[-1], child_asins)
        child_asins = ','.join(child_asins)
        item['child_asin'] = child_asins.encode('utf-8', 'ignore')
        offer_page = 'http://www.amazon.in/gp/offer-listing/' + asin
        request = scrapy.Request(offer_page, callback=self.offer_page)
        request.meta['item'] = item
        yield request

【问题讨论】:

    标签: python python-2.7 scrapy web-crawler


    【解决方案1】:

    没有。

    您可以通过检查 request.bodyrequest.headers 属性来查看发送到源的请求。

    $ scrapy shell "http://stackoverflow.com"
    >[1]: request.headers
    <[1]: 
    {b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     b'Accept-Encoding': b'gzip,deflate',
     b'Accept-Language': b'en',
     b'User-Agent': b'scrapy'}
    >[2]: request.body
    <[2]: b''
    >[3]: request.method
    <[3]: 'GET'
    

    meta 属性仅供scrapy内部使用,用于在请求之间保留一些数据。
    例如您使用meta={'name':'foo'} 向网站发出请求,然后scrapy 安排该请求,一旦响应准备好,它会使用该元+ 一些元信息创建一个Response 对象,它会自行计算并将其传递给您的Request.callback 函数.

    【讨论】:

      猜你喜欢
      • 2019-05-10
      • 2016-04-17
      • 2019-02-19
      • 1970-01-01
      • 1970-01-01
      • 2021-03-05
      • 2016-03-12
      • 2019-03-29
      • 1970-01-01
      相关资源
      最近更新 更多