Scrapy 抓取页面但不抓取项目答案

【问题标题】：Scrapy crawl pages but doesn't scraped itemsScrapy 抓取页面但不抓取项目
【发布时间】：2018-09-22 16:33:09
【问题描述】：

这是我的蜘蛛

from scrapy import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from Diplom.items import QuestionItem


class ConsultSpider(CrawlSpider):
    name = "consultation"
    allowed_domains = ['health.mail.ru']
    start_urls = ['https://health.mail.ru/consultation/1579497']

    rules = {
        Rule(LinkExtractor(allow=('.*\/consultation\/\d+'),), callback="parse_item", follow=True),

     }

    def parse_item(self, response):
        items = []
        root = Selector(response)
        posts = root.xpath('/html/body/div[2]/div[1]/div[5]/div/div[1]/div[1]/div[2]')
        for post in posts:
            item = QuestionItem()
            item['question'] = post.xpath(
            '//div[1]/div/div/div[2]/div[2]').extract()
            item['answer'] = post.xpath('//div[3]/div[2]/div[2]').extract()
            items.append(item)
        return items

问题是蜘蛛进入规则中描述的链接

INFO：已爬取 8 页（以 8 页/分钟），抓取 0 项（以 0 项/分钟）

但这不会返回任何项目。如果我改变类并这样写，我的代码就可以工作

class ConsultSpider(scrapy.Spider):
....

但这不适用于Rules。

【问题讨论】：

标签： python web-scraping scrapy web-crawler

【解决方案1】：

scrapy.Spider 是最简单的蜘蛛，它基本上会访问 start_urls 中定义的或由 start_requests() 返回的 URL。

当您需要“爬行”行为时使用 CrawlSpider - 提取链接并关注它们：

这是爬取常规网站最常用的爬虫，因为它通过定义一组规则为跟踪链接提供了一种方便的机制。它可能不是最适合您的特定网站或项目，但它对于多种情况来说足够通用，因此您可以从它开始并根据需要覆盖它以获得更多自定义功能，或者只是实现您自己的蜘蛛。

通过这个“scrapy.spider”不遵守规则，而“crawlspider”是遵守规则的，因此问题不在于这两个所以请检查您的 xpath 选择器。

【讨论】：

感谢您提供的好信息，您的建议也帮助了我。问题出在 xpath 中。