【问题标题】:Using CSS and Xpath selectors with Scrapy在 Scrapy 中使用 CSS 和 Xpath 选择器
【发布时间】:2018-05-12 13:05:19
【问题描述】:

我正在关注 Scrapy 官方教程,我应该从 http://quotes.toscrape.com 抓取数据,该教程展示了如何使用以下蜘蛛抓取数据:

class QuotesSpiderCss(scrapy.Spider):
    name = "quotes_css"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags::text').extract()
            }

然后将蜘蛛抓取到一个 JSON 文件,它会返回指定的内容:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
...]

我正在尝试使用 xpath 而不是 css 编写相同的 Spider:

class QuotesSpiderXpath(scrapy.Spider):
    name = 'quotes_xpath'
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'text': quote.xpath("//span[@class='text']/text()").extract_first(),
                'author': quote.xpath("//small[@class='author']/text()").extract_first(),
                'tags': quote.xpath("//div[@class='tags']/text()").extract()
            }

但是这个蜘蛛返回给我一个同样引用的列表:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
...]

提前致谢!

【问题讨论】:

    标签: python css xpath scrapy


    【解决方案1】:

    您总是得到相同的引用的原因是因为您没有使用相对 XPath。见documentation

    为您的 XPath 语句添加一个前缀点,如下面的解析方法:

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'text': quote.xpath(".//span[@class='text']/text()").extract_first(),
                'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
                'tags': quote.xpath(".//div[@class='tags']/text()").extract()
            }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-04-23
      • 2017-07-02
      • 1970-01-01
      • 1970-01-01
      • 2016-12-02
      • 2018-08-08
      • 1970-01-01
      • 2020-01-06
      相关资源
      最近更新 更多