【问题标题】:Scrapy only get the data of last pageScrapy 只获取最后一页的数据
【发布时间】:2021-03-14 10:37:22
【问题描述】:

我正在使用python 3.6和scrapy 2.4.1,我写了一个蜘蛛来抓取大约5页,然后使用xlsxwriter保存到excel,但是这个scarpy只获取最后一页数据,不知道为什么,这是我的蜘蛛代码

import scrapy
from scrapy.selector import Selector
from ebay.items import EbayItem


class EbaySpiderSpider(scrapy.Spider):
    name = 'ebay_spider'
    allowed_domains = ['www.ebay.com.au']
    start_urls = ['https://www.ebay.com.au/sch/auplazaplace/m.html?_nkw=&_armrs=1']

    def parse(self, response):
        item_price_extract = []
        item_title = []
        item_title_list = response.xpath('//h3[@class="lvtitle"]/a')
        item_href = response.xpath('//h3[@class="lvtitle"]/a/@href').getall()
        for title in item_title_list:
            item_title_text = title.xpath('string(.)').get()
            item_title.append(item_title_text)
        item_price = response.xpath('//li[@class="lvprice prc"]//span[@class="bold"]')
        for i in range(len(item_price)):
            item_price_text = item_price[i].xpath('string(.)').get()
            item_price_extract.append(item_price_text.strip())
        item_info = EbayItem(title=item_title, price=item_price_extract, item_href=item_href)
        yield item_info
        next_url_href = response.xpath('//a[@class="gspr next"]/@href').get()
        if not next_url_href:
            return
        else:
            yield scrapy.Request(next_url_href, callback=self.parse)

和管道代码

import xlsxwriter


class EbayPipeline:
    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        col_num = 0
        workbook = xlsxwriter.Workbook(r'C:\Users\Clevo\Desktop\store_spider.xlsx')
        worksheet = workbook.add_worksheet()
        item_source = dict(item)
        # print(item_source)
        for key, values in item_source.items():
            worksheet.write(0, col_num, key)
            worksheet.write_column(1, col_num, values)
            col_num += 1
        workbook.close()
        return item

有人知道原因吗?看起来一切正常,但我只能获取最后一页数据

顺便问一下,有没有将数据传输到另一个函数?我想抓取页面详细信息并将数据传输到 process_item 函数并将它们一起生成

【问题讨论】:

标签: python scrapy


【解决方案1】:

最好先抓取每一页,并在其产品页面上获取数据。

class EbaySpiderSpider(scrapy.Spider):
    name = "ebay_spider"
    
    def start_requests(self):
        base_url = 'https://www.ebay.com.au/sch/auplazaplace/m.html?_nkw=&_armrs='
        for i in range(1,6):
            page = base_url + str(i)#i will be the page number and add to base_url
            yield scrapy.Request(url=page , callback=self.parse)

    # scraped all product links first and yield to parse_contents
    def parse(self, response):
        links = response.xpath('//h3[@class="lvtitle"]/a/@href').extract()
        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_contents)

    #scraped desired data on product page
    def parse_contents(self, response):
        product_url = response.url
        title = response.xpath('//h1/text()').extract()[0]
        price = response.xpath('//span[@itemprop="price"]/text()').extract()[0]
        item = EbayItem()
        item['product_title'] = title
        item['product_price'] = price
        yield item ### to items.py

items.py,确保item键等于scrapy.Field()

class EbayITem(scrapy.Item):
    product_title = scrapy.Field()
    product_price = scrapy.Field()

管道.py

import xlsxwriter


class EbayPipeline:
    def process_item(self, item, spider):
        title = item['product_title']
        price = item['product_price']
        #process your worksheet here

【讨论】:

    【解决方案2】:

    代码的工作版本

    import scrapy
    from scrapy.selector import Selector
    from ebay.items import EbayItem
    
    
    class EbaySpiderSpider(scrapy.Spider):
        name = 'ebay_spider'
        allowed_domains = ['ebay.com.au']
        start_urls = ['https://www.ebay.com.au/sch/auplazaplace/m.html?_nkw=&_armrs=1']
    
        def parse(self, response):
            item_price_extract = []
            item_title = []
            item_title_list = response.xpath('//h3[@class="lvtitle"]/a')
            item_href = response.xpath('//h3[@class="lvtitle"]/a/@href').getall()
            for title in item_title_list:
                item_title_text = title.xpath('string(.)').get()
                item_title.append(item_title_text)
            item_price = response.xpath('//li[@class="lvprice prc"]//span[@class="bold"]')
            for i in range(len(item_price)):
                item_price_text = item_price[i].xpath('string(.)').get()
                item_price_extract.append(item_price_text.strip())
            item_info = EbayItem(title=item_title, price=item_price_extract, item_href=item_href)
            yield item_info
            next_url_href = response.xpath('//a[@class="gspr next"]/@href').get()
            if next_url_href is not None:
                next_url_href = response.urljoin(next_url_href)
                yield scrapy.Request(next_url_href, callback=self.parse)
    

    您必须在 settings.py 中设置ROBOTSTXT_OBEY=False(这不是一个好习惯),否则您的蜘蛛不会抓取数据并会给出消息: [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.ebay.com.au/sch/auplazaplace/m.html?_nkw=&_armrs=1>

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-01-12
      • 1970-01-01
      • 1970-01-01
      • 2020-09-25
      • 2023-03-20
      • 1970-01-01
      相关资源
      最近更新 更多