【发布时间】:2021-03-14 10:37:22
【问题描述】:
我正在使用python 3.6和scrapy 2.4.1,我写了一个蜘蛛来抓取大约5页,然后使用xlsxwriter保存到excel,但是这个scarpy只获取最后一页数据,不知道为什么,这是我的蜘蛛代码
import scrapy
from scrapy.selector import Selector
from ebay.items import EbayItem
class EbaySpiderSpider(scrapy.Spider):
name = 'ebay_spider'
allowed_domains = ['www.ebay.com.au']
start_urls = ['https://www.ebay.com.au/sch/auplazaplace/m.html?_nkw=&_armrs=1']
def parse(self, response):
item_price_extract = []
item_title = []
item_title_list = response.xpath('//h3[@class="lvtitle"]/a')
item_href = response.xpath('//h3[@class="lvtitle"]/a/@href').getall()
for title in item_title_list:
item_title_text = title.xpath('string(.)').get()
item_title.append(item_title_text)
item_price = response.xpath('//li[@class="lvprice prc"]//span[@class="bold"]')
for i in range(len(item_price)):
item_price_text = item_price[i].xpath('string(.)').get()
item_price_extract.append(item_price_text.strip())
item_info = EbayItem(title=item_title, price=item_price_extract, item_href=item_href)
yield item_info
next_url_href = response.xpath('//a[@class="gspr next"]/@href').get()
if not next_url_href:
return
else:
yield scrapy.Request(next_url_href, callback=self.parse)
和管道代码
import xlsxwriter
class EbayPipeline:
def open_spider(self, spider):
pass
def process_item(self, item, spider):
col_num = 0
workbook = xlsxwriter.Workbook(r'C:\Users\Clevo\Desktop\store_spider.xlsx')
worksheet = workbook.add_worksheet()
item_source = dict(item)
# print(item_source)
for key, values in item_source.items():
worksheet.write(0, col_num, key)
worksheet.write_column(1, col_num, values)
col_num += 1
workbook.close()
return item
有人知道原因吗?看起来一切正常,但我只能获取最后一页数据
顺便问一下,有没有将数据传输到另一个函数?我想抓取页面详细信息并将数据传输到 process_item 函数并将它们一起生成
【问题讨论】:
-
您不必在允许的域中添加
www。将['www.ebay.com.au']更改为['ebay.com.au'](参考:docs.scrapy.org/en/latest/topics/…) -
等等,我会重构你的代码。