【问题标题】:Is there a way of stopping next_page from being None?有没有办法阻止 next_page 成为无?
【发布时间】:2021-03-05 04:30:52
【问题描述】:

我目前正在使用 Google Scholar 抓取工具,它应该在几年内迭代多个查询,并返回每年的前 30 个项目,这些项目以格式化的 csv 文件编写。但是,每次我运行程序时,都会有一些实例在调用 response.xpath 时 next_page 变量为 None,即使每个请求的 url 都是相同的,只是年份发生了变化。

下面是蜘蛛的尸体:

class ExampleSpider(scrapy.Spider):
    name = 'worktime'
    allowed_domains = ['api.scraperapi.com']
    years = [2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011,
             2010, 2009, 2008, 2007, 2006, 2005]
    query = ('(Extinct OR Extinction) AND ("Loxodonta africana" OR "african '
             'elephant")')
    start_urls = ['https://scholar.google.com/scholar?']

    def yield_year(self):
        if self.years:
            year = self.years.pop()
            url = 'https://scholar.google.com/scholar?' + urlencode({
                'hl': 'en', 'q': self.query, 'as_ylo': str(year), 'as_yhi':
                    str(
                    year)})
            return scrapy.Request(get_url(url), self.parse_item_list, meta={
                'position': 0})
        else:
            print("All done")

    def parse(self, response):
        print(response.url)
        yield self.yield_year()

    def parse_item_list(self, response):
        position = response.meta['position']
        year_published = response.url[-4:]
        for res in response.xpath('//*[@data-rp]'):
            link = res.xpath('.//h3/a/@href').extract_first()
            temp = res.xpath('.//h3/a//text()').extract()
            if not temp:
                title = "[C] " + "".join(
                    res.xpath('.//h3/span[@id]//text()').extract())
            else:
                title = "".join(temp)
            # snippet = "".join(
            # res.xpath('.//*[@class="gs_rs"]//text()').extract())
            # cited = res.xpath(
            # './/a[starts-with(text(),"Cited")]/text()').extract_first()
            # temp = res.xpath(
            # './/a[starts-with(text(),"Related")]/@href').extract_first()
            # related = "https://scholar.google.com" + temp if temp else ""
            # num_versions = res.xpath(
            # './/a[contains(text(),"version")]/text()').extract_first()
            published_data = "".join(
                res.xpath('.//div[@class="gs_a"]//text()').extract())
            position += 1
            item = {'Title': title, 'Author': published_data,
                    'Year': year_published}
            yield item

        # URL of the next page
        next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()

        if position < 30 and next_page is not None:
            url = "https://scholar.google.com" + next_page
            yield scrapy.Request(get_url(url), self.parse_item_list, meta={'position': position})

        else:
            yield self.yield_year()

如何确保爬虫返回 next_page 的 url 而不必将指向下一页的链接硬编码到 parse_item_list 函数中?

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    更新:问题已解决。我需要实现一个try-except,刮板会尝试提取下一页的url,如果它没有提取链接,程序将抛出一个TypeError并产生一个请求,当前链接的dont_filter设置为True .我还添加了一个retry_counter,这样如果3次尝试后都没有找到链接,那么很可能是因为没有下一页,所以我们继续下一个查询。

                try:
                    if self.page_num < 3 and self.retry_counter < 3:
                    next_page = response.xpath('.//div[@id="gs_nml"]/a['
                                               'starts-with(text(),' + str(
                        (self.page_num + 1)) + ')]/@href').extract_first()
    
                    if next_page is not None:
                        self.page_num += 1
    
                    else:
                        raise TypeError
    
                except TypeError:
                    print("I got no next page link! Trying again just in case.")
                    self.retry_counter += 1
                    yield scrapy.Request(response.url, callback=self.parse_item_list,
                                         meta={'position': response.meta['position']},
                                         dont_filter=True)
    

    【讨论】:

      猜你喜欢
      • 2017-10-06
      • 2019-10-20
      • 1970-01-01
      • 2020-05-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-05-05
      相关资源
      最近更新 更多