让scrapy跟随页面上的特定链接答案

【问题标题】：Getting scrapy to follow specific links on a page让scrapy跟随页面上的特定链接
【发布时间】：2015-08-10 01:44:08
【问题描述】：

我正在尝试从 The Original Hip Hop Lyrics Archive 中抓取歌词。

如果我在艺术家页面上发布它，我已经成功编写了一个爬取艺术家歌词的蜘蛛，例如：http://www.ohhla.com/anonymous/aesoprck/。

但是当我在此页面上发布它并带有指向不同艺术家页面的链接 http://www.ohhla.com/all.html 时，我什么也得不到。

这是我试图用来跟踪艺术家页面链接的规则：

Rule(LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True)

这是我尝试使用的规则，用于跟踪指向不同页面的链接以及指向艺术家页面的链接：

Rule(LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True)

我修改了 Scrapy 中的教程以使其工作，因为由于某种原因，当我开始一个新项目时它不起作用。

这是我完整的蜘蛛工作示例：

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor


class ohhlaSpider(CrawlSpider):
    name = "ohhla"
    download_delay = 0.5
    allowed_domains = ["ohhla.com"]
    start_urls = ["http://www.ohhla.com/anonymous/aesoprck/"]
    rules = (Rule (LinkExtractor(restrict_xpaths=('//h3/a/@href',)), follow= True), # trying to follow links to pages with more links to artist pages
             Rule (LinkExtractor(restrict_xpaths=('//pre/a/@href',)), follow= True), # trying to follow links to artist pages
             Rule (LinkExtractor(deny_extensions=("txt"),restrict_xpaths=('//ul/li',)), follow= True), # succeeding in following links to album pages
             Rule (LinkExtractor(restrict_xpaths=('//ul/li',)), callback="extract_text", follow= False),) # succeeding in extracting lyrics from the songs on album pages

    def extract_text(self, response):
        """ extract text from webpage"""
        string = response.xpath('//pre/text()').extract()[0]
        with open("lyrics.txt", 'wb') as f:
            f.write(string)

【问题讨论】：

标签： python web-scraping scrapy scrapy-spider

【解决方案1】：

此答案的第二部分可用于抓取网页中的特定链接。 https://stackoverflow.com/a/40146522/4418897

【讨论】：

【解决方案2】：

restrict_xpaths 不应指向 @href 属性。它应该指向链接提取器搜索链接的地方：

Rule(LinkExtractor(restrict_xpaths='//h3'), follow=True)

请注意，您可以将其指定为字符串而不是元组。

你也可以allow所有包含all*.html的链接：

Rule(LinkExtractor(allow=r'all.*?\.html'), follow=True)

您还应该确保您的蜘蛛实际上正在访问该“父目录”页面。开始爬行听起来很合乎逻辑，因为这是目录的索引页面：

start_urls = ["http://www.ohhla.com/all.html"]

【讨论】：