【问题标题】:Xpath selection only returns first response resultXpath 选择只返回第一个响应结果
【发布时间】:2021-06-26 09:38:11
【问题描述】:

我还是scrapy的新手。在尝试从 quotes.toscrape 读取数据时,使用 xpath 选择器时我没有返回任何内容。一旦我使用 css 选择器,一切都会按预期工作。即使示例非常简单,我也找不到错误。

quotes.py

import scrapy
from quotes_loader.items import QuotesLoaderItem as QL

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com//']

    def parse(self, response):
        item = QL()
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            # CSS-Selector
            # item['author_name'] = quote.css('small.author::text').get()
            # item['quote_text'] = quote.css('span.text::text').get()
            # item['author_link'] = quote.css('small.author + a::attr(href)').get()
            # item['tags'] = quote.css('div.tags > a.tag::text').get()

            # XPATH-Selektor
            item['author_name'] = quote.xpath('//small[@class="author"]/text()').get()
            item['quote_text'] = quote.xpath('//span[@class="text"]/text()').get()
            item['author_link'] = quote.xpath('//small[@class="author"]/following-sibling::a/@href').get()
            item['tags'] = quote.xpath('//*[@class="tags"]/*[@class="tag"]/text()').get()

            yield item

        # next_page_url = response.css('li.next > a::attr(href)').get()
        next_page_url = response.xpath('//*[class="next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(absolute_next_page_url)

items.py

import scrapy
from scrapy.loader import ItemLoader


class QuotesLoaderItem(scrapy.Item):
    # define the fields for your item here like:
    author_name = scrapy.Field()
    quote_text = scrapy.Field()
    author_link = scrapy.Field()
    tags = scrapy.Field()

结果

author_name,quote_text,author_link,tags
Albert Einstein,“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,/author/Albert-Einstein,change
Albert Einstein, ...
...
(20 times)

感谢您的承诺

【问题讨论】:

  • 您能否提供一个运行它的示例网址?
  • @Forensic_07 嘿。我不明白。 Python 脚本有一个 URL,从该 URL 中抓取数据。 (allowed_domains 和 start_urls)
  • 我错了,我还以为是假网址!道歉。

标签: python xpath scrapy css-selectors items


【解决方案1】:

我使用选择器对象而不是响应对象,因此语法必须如下所示。

import scrapy
from quotes_loader.items import QuotesLoaderItem as QL

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = [
        'http://quotes.toscrape.com//']

    def parse(self, response):
        item = QL()
        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:
            # CSS-Selector
            # item['author_name'] = quote.css('small.author::text').get()
            # item['quote_text'] = quote.css('span.text::text').get()
            # item['author_link'] = quote.css('small.author + a::attr(href)').get()
            # item['tags'] = quote.css('div.tags > a.tag::text').get()
            
            # XPATH-Selector
            item['author_name'] = quote.xpath('.//small[@class="author"]/text()').get()
            item['quote_text'] = quote.xpath('.//span[@class="text"]/text()').get()
            item['author_link'] = quote.xpath('.//small[@class="author"]/following-sibling::a/@href').get()
            item['tags'] = quote.xpath('.//*[@class="tags"]/*[@class="tag"]/text()').get()

            yield item

        # next_page_url = response.css('li.next > a::attr(href)').get()
        next_page_url = response.xpath('.//*[class="next"]/a/@href').extract_first()
        absolute_next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(absolute_next_page_url)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-04-17
    • 2016-05-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多