【发布时间】:2021-06-26 09:38:11
【问题描述】:
我还是scrapy的新手。在尝试从 quotes.toscrape 读取数据时,使用 xpath 选择器时我没有返回任何内容。一旦我使用 css 选择器,一切都会按预期工作。即使示例非常简单,我也找不到错误。
quotes.py
import scrapy
from quotes_loader.items import QuotesLoaderItem as QL
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = [
'http://quotes.toscrape.com//']
def parse(self, response):
item = QL()
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
# CSS-Selector
# item['author_name'] = quote.css('small.author::text').get()
# item['quote_text'] = quote.css('span.text::text').get()
# item['author_link'] = quote.css('small.author + a::attr(href)').get()
# item['tags'] = quote.css('div.tags > a.tag::text').get()
# XPATH-Selektor
item['author_name'] = quote.xpath('//small[@class="author"]/text()').get()
item['quote_text'] = quote.xpath('//span[@class="text"]/text()').get()
item['author_link'] = quote.xpath('//small[@class="author"]/following-sibling::a/@href').get()
item['tags'] = quote.xpath('//*[@class="tags"]/*[@class="tag"]/text()').get()
yield item
# next_page_url = response.css('li.next > a::attr(href)').get()
next_page_url = response.xpath('//*[class="next"]/a/@href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(absolute_next_page_url)
items.py
import scrapy
from scrapy.loader import ItemLoader
class QuotesLoaderItem(scrapy.Item):
# define the fields for your item here like:
author_name = scrapy.Field()
quote_text = scrapy.Field()
author_link = scrapy.Field()
tags = scrapy.Field()
结果
author_name,quote_text,author_link,tags
Albert Einstein,“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”,/author/Albert-Einstein,change
Albert Einstein, ...
...
(20 times)
感谢您的承诺
【问题讨论】:
-
您能否提供一个运行它的示例网址?
-
@Forensic_07 嘿。我不明白。 Python 脚本有一个 URL,从该 URL 中抓取数据。 (allowed_domains 和 start_urls)
-
我错了,我还以为是假网址!道歉。
标签: python xpath scrapy css-selectors items