Scrapy 不会抓取下一页 url答案

【问题标题】：Scrapy is not Crawling the next page urlScrapy 不会抓取下一页 url
【发布时间】：2018-09-26 00:49:00
【问题描述】：

我的蜘蛛没有抓取第 2 页，但 XPath 正在返回正确的下一页链接，这是指向下一页的绝对链接。

这是我的代码

from scrapy import Spider
from scrapy.http import Request, FormRequest



class MintSpiderSpider(Spider):

    name = 'Mint_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/']

    def parse(self, response):
        urls =  response.xpath('//div[@class = "post-inner post-hover"]/h2/a/@href').extract()

        for url in urls:
            yield Request(url, callback=self.parse_lyrics)

        next_page_url = response.xpath('//li[@class="next right"]/a/@href').extract_first()
        if next_page_url:
            yield scrapy.Request(next_page_url, callback=self.parse)


    def parse_foo(self, response):
        info = response.xpath('//*[@class="songinfo"]/p/text()').extract()
        name =  response.xpath('//*[@id="lyric"]/h2/text()').extract()

        yield{
            'name' : name,
            'info': info
        }

【问题讨论】：

Scrapy: Following pagination link to scrape data的可能重复
请更正缩进。
其实缩进是对的，我不小心把它分成了两部分，现在你可以检查一下。

标签： python web-scraping scrapy

【解决方案1】：

问题是next_page_url是一个列表，它需要是一个url作为字符串。您需要使用extract_first() 函数而不是next_page_url = response.xpath('//li[@class="next right"]/a/@href').extract() 中的extract()。

更新

您必须import scrapy，因为您使用的是yield scrapy.Request(next_page_url, callback=self.parse)

【讨论】：

谢谢先生，这 ` next_page_url = response.xpath('//li[@class="next right"]/a/@href').extract() if next_page_url: yield scrapy.Request (next_page_url, callback=self.parse)` 应该不在 for 循环中了吧？
是的，如果没有，您将在每次请求页面的 url 之一时请求 next_page_url。
先生，我按照您所说的方式缩进了代码，但在 Next_page_url 行中显示缩进错误，但这里的一切看起来都很完美，它也脱离了循环。
我用正确的缩进编辑了你的帖子，复制粘贴它，然后仔细检查缩进
尊敬的先生，我完全按照您说的做了，但是 scrappy 只爬到了第一页然后就停止了