Python scrapy 返回不完整的数据答案

【问题标题】：Python scrapy returns uncomplete dataPython scrapy 返回不完整的数据
【发布时间】：2020-12-21 10:50:05
【问题描述】：

我正在为网络数据抓取创建一个抓取工具。共有58页，每页有12个产品。数据应返回为 58 x 12 = 696 个产品标题，但它仅返回 404 个产品的数据。这是我的代码

import scrapy
from fundrazr.items import FundrazrItem
from datetime import datetime
import re


class Fundrazr(scrapy.Spider):
    name = "my_scraper"

    # First Start Url
    start_urls = ["https://perfumehut.com.pk/shop/"]

    npages = 57

    # This mimics getting the pages using the next button. 
    for i in range(2, npages + 1):
        start_urls.append("https://perfumehut.com.pk/shop/page/"+str(i)+"")
    
    def parse(self, response):
        for href in response.xpath("//h3[contains(@class, 'product-title')]/a/@href"):
            # add the scheme, eg http://
            url  = "" + href.extract() 
            yield scrapy.Request(url, callback=self.parse_dir_contents) 
                    
    def parse_dir_contents(self, response):
        item = FundrazrItem()

        # Getting Campaign Title
        item['campaignTitle'] = response.xpath("//h1[contains(@class, 'entry-title')]/text()").extract()

        yield item

它是一个 woocommerce 网站，首页是

https://perfumehut.com.pk/shop/

和其他页面一样分页

https://perfumehut.com.pk/shop/page/2/
https://perfumehut.com.pk/shop/page/3/
and up to 58.

我想通过获取 npages 来了解我做错了什么？

问候

【问题讨论】：

您必须检查服务器是否正确返回页面（查看是否有任何页面不是状态码 200）。您是否也尝试过使用 url 参数？通过增加per_page 参数可以大大减少页数。例如：perfumehut.com.pk/shop/?per_page=500

标签： python python-3.x web-scraping scrapy

【解决方案1】：

import scrapy
from fundrazr.items import FundrazrItem
from datetime import datetime
import re


class Fundrazr(scrapy.Spider):
    name = "my_scraper"

    # First Start Url
    start_urls = ["https://perfumehut.com.pk/shop/"]
    
    def parse(self, response):
        data = FundrazrItem()
        
        for item in response.xpath("//div[contains(@class, 'products elements-grid ')]/div[contains(@class, 'product-grid-item product ')]/h3/a"):
            data['campaignTitle'] = item.xpath("./text()").extract_first()

            yield data

        next_page = response.xpath("//ul[@class='page-numbers']/li[last()]/a/@href").extract_first()
        if next_page is not None:
            yield scrapy.Request(next_page, callback=self.parse)

【讨论】：

纯代码答案不是很有用。您能否添加一些关于如何以及为什么解决 OP 问题的解释？