Xpath 选择只返回第一个响应结果答案

【问题标题】：Xpath selection only returns first response resultXpath 选择只返回第一个响应结果
【发布时间】：2021-06-26 09:38:11
【问题描述】：

我还是scrapy的新手。在尝试从 quotes.toscrape 读取数据时，使用 xpath 选择器时我没有返回任何内容。一旦我使用 css 选择器，一切都会按预期工作。即使示例非常简单，我也找不到错误。

quotes.py

import scrapy
from quotes_loader.items import QuotesLoaderItem as QL

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com//']

    def parse(self, response):
        item = QL()
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            # CSS-Selector
            # item['author_name'] = quote.css('small.author::text').get()
            # item['quote_text'] = quote.css('span.text::text').get()
            # item['author_link'] = quote.css('small.author + a::attr(href)').get()
            # item['tags'] = quote.css('div.tags > a.tag::text').get()

            # XPATH-Selektor
            item['author_name'] = quote.xpath('//small[@class="author"]/text()').get()
            item['quote_text'] = quote.xpath('//span[@class="text"]/text()').get()
            item['author_link'] = quote.xpath('//small[@class="author"]/following-sibling::a/@href').get()
            item['tags'] = quote.xpath('//*[@class="tags"]/*[@class="tag"]/text()').get()

            yield item

        # next_page_url = response.css('li.next > a::attr(href)').get()
        next_page_url = response.xpath('//*[class="next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(absolute_next_page_url)

items.py

import scrapy
from scrapy.loader import ItemLoader


class QuotesLoaderItem(scrapy.Item):
    # define the fields for your item here like:
    author_name = scrapy.Field()
    quote_text = scrapy.Field()
    author_link = scrapy.Field()
    tags = scrapy.Field()

结果

author_name,quote_text,author_link,tags
Albert Einstein,“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,/author/Albert-Einstein,change
Albert Einstein, ...
...
(20 times)

感谢您的承诺

【问题讨论】：

您能否提供一个运行它的示例网址？
@Forensic_07 嘿。我不明白。 Python 脚本有一个 URL，从该 URL 中抓取数据。 (allowed_domains 和 start_urls)
我错了，我还以为是假网址！道歉。

标签： python xpath scrapy css-selectors items

【解决方案1】：

我使用选择器对象而不是响应对象，因此语法必须如下所示。

import scrapy
from quotes_loader.items import QuotesLoaderItem as QL

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com//']

    def parse(self, response):
        item = QL()
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            # CSS-Selector
            # item['author_name'] = quote.css('small.author::text').get()
            # item['quote_text'] = quote.css('span.text::text').get()
            # item['author_link'] = quote.css('small.author + a::attr(href)').get()
            # item['tags'] = quote.css('div.tags > a.tag::text').get()
            
            # XPATH-Selector
            item['author_name'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['quote_text'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author_link'] = quote.xpath('.//small[@class="author"]/following-sibling::a/@href').get()
            item['tags'] = quote.xpath('.//*[@class="tags"]/*[@class="tag"]/text()').get()

            yield item

        # next_page_url = response.css('li.next > a::attr(href)').get()
        next_page_url = response.xpath('.//*[class="next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(absolute_next_page_url)

【讨论】：