【问题标题】:Scrapy Pagination Fails on Multiple ListingScrapy 分页在多个列表中失败
【发布时间】:2018-09-18 09:11:29
【问题描述】:

我正在尝试使用 scrapy 抓取网站。 当我抓取特定页面时,分页抓取有效,但是当我尝试用一​​个跳转分页抓取所有页面时,分页不起作用。
我尝试为分页创建一个额外的功能,但这并不能解决问题。所有帮助将不胜感激。我究竟做错了什么 ?这是我的代码:

# -*- coding: utf-8 -*-
import scrapy

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.http import Request

from avtogumi.items import AvtogumiItem


class BasicSpider(scrapy.Spider):
    name = 'gumi'
    allowed_domains = ['avtogumi.bg']
    start_urls = ['https://bg.avtogumi.bg/oscommerce/index.php' ]

    def parse(self, response):

        urls = response.xpath('//div[@class="brands"]//a/@href').extract()
        for url in urls:
            url = response.urljoin(url)
            yield scrapy.Request(url=url, callback=self.parse_params)


    def parse_params(self, response):

        l = ItemLoader(item=AvtogumiItem(), response=response)

        l.add_xpath('title', '//h4/a/text()')
        l.add_xpath('subtitle', '//p[@class="ft-darkgray"]/text()')
        l.add_xpath('price', '//span[@class="promo-price"]/text()',
            MapCompose(str.strip, str.title))
        l.add_xpath('stock', '//div[@class="product-box-stock"]//span/text()')
        l.add_xpath('category', '//div[@class="labels hidden-md hidden-lg"][0]//text()')
        l.add_xpath('brand', '//h4[@class="brand-header"][0]//text()', 
            MapCompose(str.strip, str.title))
        l.add_xpath('img_path', '//div/img[@class="prod-imglist"]/@src')

        yield l.load_item()

        next_page_url = response.xpath('//li/a[@class="next"]/@href').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse_params)

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    这里的问题是这样的:

    l = ItemLoader(item=AvtogumiItem(), response=response)
    
    l.add_xpath('title', '//h4/a/text()')
    l.add_xpath('subtitle', '//p[@class="ft-darkgray"]/text()')
    l.add_xpath('price', '//span[@class="promo-price"]/text()',
        MapCompose(str.strip, str.title))
    l.add_xpath('stock', '//div[@class="product-box-stock"]//span/text()')
    l.add_xpath('category', '//div[@class="labels hidden-md hidden-lg"][0]//text()')
    l.add_xpath('brand', '//h4[@class="brand-header"][0]//text()', 
        MapCompose(str.strip, str.title))
    l.add_xpath('img_path', '//div/img[@class="prod-imglist"]/@src')
    
    yield l.load_item()
    

    这个sn-p 代码将只解析和加载一个结果。如果您有一个包含多个结果的页面,则必须将此代码放入 for 循环中并遍历您要解析的所有搜索结果:

    objects = response.xpath('my_selector_here')
    for object in objects:
        l = ItemLoader(item=AvtogumiItem(), response=response)
    
        l.add_xpath('title', '//h4/a/text()')
        l.add_xpath('subtitle', '//p[@class="ft-darkgray"]/text()')
        l.add_xpath('price', '//span[@class="promo-price"]/text()',
            MapCompose(str.strip, str.title))
        l.add_xpath('stock', '//div[@class="product-box-stock"]//span/text()')
        l.add_xpath('category', '//div[@class="labels hidden-md hidden-lg"][0]//text()')
        l.add_xpath('brand', '//h4[@class="brand-header"][0]//text()', 
            MapCompose(str.strip, str.title))
        l.add_xpath('img_path', '//div/img[@class="prod-imglist"]/@src')
    
        yield l.load_item()
    

    希望对你有帮助

    【讨论】:

    • 谢谢@Woody1193 解决了我的问题:)
    【解决方案2】:

    使用/重写此代码

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy import Request
    
    
    class BasicSpider(scrapy.Spider):
        name = 'gumi'
        allowed_domains = ['avtogumi.bg']
        start_urls = ['https://bg.avtogumi.bg/oscommerce/']
    
        def parse(self, response):   
            urls = response.xpath('//div[@class="brands"]//a/@href').extract()
            for url in urls:
                yield Request(url=response.urljoin(url), callback=self.parse_params)
    
        def parse_params(self, response):
            subjects = response.xpath('//div[@class="full-product-box search-box"]')
            for subject in subjects:
                yield {
                    'title': subject.xpath('.//h4/a/text()').extract_first(),
                    'subtitle': subject.xpath('.//p[@class="ft-darkgray"]/text()').extract_first(),
                    'price': subject.xpath('.//span[@class="promo-price"]/text()').extract_first(),
                    'stock': subject.xpath('.//div[@class="product-box-stock"]//span/text()').extract_first(),
                    'category': subject.xpath('.//div[@class="labels hidden-md hidden-lg"][0]//text()').extract_first(),
                    'brand': subject.xpath('.//h4[@class="brand-header"][0]//text()').extract_first(),
                    'img_path': subject.xpath('.//div/img[@class="prod-imglist"]/@src').extract_first(),
                }
            next_page_url = response.xpath('//li/a[@class="next"]/@href').extract_first()
            if next_page_url:
                yield Request(url=next_page_url, callback=self.parse_params)
    

    13407 个项目被抓取

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-05-27
      • 1970-01-01
      • 2017-05-02
      • 1970-01-01
      • 1970-01-01
      • 2014-05-08
      • 1970-01-01
      相关资源
      最近更新 更多