【发布时间】:2021-03-05 04:30:52
【问题描述】:
我目前正在使用 Google Scholar 抓取工具,它应该在几年内迭代多个查询,并返回每年的前 30 个项目,这些项目以格式化的 csv 文件编写。但是,每次我运行程序时,都会有一些实例在调用 response.xpath 时 next_page 变量为 None,即使每个请求的 url 都是相同的,只是年份发生了变化。
下面是蜘蛛的尸体:
class ExampleSpider(scrapy.Spider):
name = 'worktime'
allowed_domains = ['api.scraperapi.com']
years = [2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011,
2010, 2009, 2008, 2007, 2006, 2005]
query = ('(Extinct OR Extinction) AND ("Loxodonta africana" OR "african '
'elephant")')
start_urls = ['https://scholar.google.com/scholar?']
def yield_year(self):
if self.years:
year = self.years.pop()
url = 'https://scholar.google.com/scholar?' + urlencode({
'hl': 'en', 'q': self.query, 'as_ylo': str(year), 'as_yhi':
str(
year)})
return scrapy.Request(get_url(url), self.parse_item_list, meta={
'position': 0})
else:
print("All done")
def parse(self, response):
print(response.url)
yield self.yield_year()
def parse_item_list(self, response):
position = response.meta['position']
year_published = response.url[-4:]
for res in response.xpath('//*[@data-rp]'):
link = res.xpath('.//h3/a/@href').extract_first()
temp = res.xpath('.//h3/a//text()').extract()
if not temp:
title = "[C] " + "".join(
res.xpath('.//h3/span[@id]//text()').extract())
else:
title = "".join(temp)
# snippet = "".join(
# res.xpath('.//*[@class="gs_rs"]//text()').extract())
# cited = res.xpath(
# './/a[starts-with(text(),"Cited")]/text()').extract_first()
# temp = res.xpath(
# './/a[starts-with(text(),"Related")]/@href').extract_first()
# related = "https://scholar.google.com" + temp if temp else ""
# num_versions = res.xpath(
# './/a[contains(text(),"version")]/text()').extract_first()
published_data = "".join(
res.xpath('.//div[@class="gs_a"]//text()').extract())
position += 1
item = {'Title': title, 'Author': published_data,
'Year': year_published}
yield item
# URL of the next page
next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()
if position < 30 and next_page is not None:
url = "https://scholar.google.com" + next_page
yield scrapy.Request(get_url(url), self.parse_item_list, meta={'position': position})
else:
yield self.yield_year()
如何确保爬虫返回 next_page 的 url 而不必将指向下一页的链接硬编码到 parse_item_list 函数中?
【问题讨论】:
标签: python web-scraping scrapy