【问题标题】:xpath doesnt correctly select HTML in Scrapy's parsexpath 在 Scrapy 的解析中没有正确选择 HTML
【发布时间】:2018-08-01 18:15:04
【问题描述】:

我正在尝试使用 Scrapy 和 Splash 解析 Target search page 上的产品名称。我使用Splash发送请求yield SplashRequest(url=i, callback=self.parse, headers = {"User-Agent": ua.chrome}),然后使用解析函数提取product_name

def parse(self, response):

    print("INSIDE PARSE TARGET")
    for product in response.xpath('//div[@data-test="productGridContainer"]/div[2]/ul/li//div[@data-test="product-card"]'):

        print("in PRODUCT")

        print(product)
        product_name = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@aria-label').extract_first()
        print("Product name: " + str(product_name))
        print("ratio: " + str(fuzz.partial_ratio(target_name.lower(), product_name.lower())))

        if fuzz.partial_ratio(target_name.lower(), product_name.lower()) > self.max_score:
            self.max_score = fuzz.partial_ratio(target_name.lower(), product_name.lower())
            self.product_page = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@href').extract_first()
            print("product_page: " + self.product_page)

        print("---------------------------------------")

    print("***********************************")
    print("max_score is: " + str(self.max_score))
    self.product_page = response.urljoin(self.product_page)
    print("FOUND PRODUCT AT PAGE: " + self.product_page)
    yield SplashRequest(url=self.product_page, callback=self.parseProduct, headers = {"User-Agent": ua.chrome})

但是,这就是我得到的全部。它从不进入 for 循环,我不明白。

2018-08-01 14:08:04 [scrapy.core.engine] INFO: Spider opened
2018-08-01 14:08:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-01 14:08:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6044
2018-08-01 14:08:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.target.com/s?searchTerm=google+home+%2B via http://localhost:8050/render.html> (referer: None)
INSIDE PARSE TARGET
***********************************
max_score is: 0
FOUND PRODUCT AT PAGE: https://www.target.com/s?searchTerm=google+home+%2B
2018-08-01 14:08:07 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.target.com/s?searchTerm=google+home+%2B> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-08-01 14:08:07 [scrapy.core.engine] INFO: Closing spider (finished)

【问题讨论】:

    标签: xpath scrapy screen-scraping scrapy-splash


    【解决方案1】:

    您的抓取工具中没有循环。正如此日志行所示:

    DEBUG:过滤的重复请求:https://www.target.com/s?searchTerm=google+home+%2B> - 不再显示重复项(请参阅 DUPEFILTER_DEBUG 以显示所有重复项)

    您正在尝试再次抓取刚刚抓取的页面,而scrapy 的欺骗过滤器正在过滤掉这个请求。

    您的self.product_page 似乎返回与您相同的网址,而不是新网址。我对您的代码进行了一些重构以尝试理解您的问题:

    def parse(self, response):
        products = response.xpath('//div[@data-test="productGridContainer"]/div[2]/ul/li//div[@data-test="product-card"]')
        max_score = 0
        target_name = '???'
        product_page = None
        for product in products:
            name = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@aria-label').extract_first()
            url = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@href').extract_first()
            if response.urljoin(url) == response.url:
                continue # avoid crawling current page
            ratio = fuzz.partial_ratio(target_name.lower(), name.lower()))
            if ratio > self.max_score:
                max_score = ratio
                product_page = url
    
        if product_page:
            print(f'max_score: {max_score}')
            print(f'product: {product_page}')
            yield SplashRequest(response.urljoin(product_page), 
                                callback=self.parse_product, 
                                headers = {"User-Agent": ua.chrome})
    

    【讨论】:

    • 但是你的代码和我的没有什么不同。我试过你的代码,它仍然没有进入for循环for product in products:。我的 xpath 是正确的,所以我不知道为什么这不起作用
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-05-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-09-19
    • 2011-08-23
    相关资源
    最近更新 更多