【发布时间】:2022-01-17 15:00:24
【问题描述】:
我通过链接提取器使用scrapy进行爬行,我在scrapy链接提取器中使用了正确的XPath表达式,但我不知道为什么它会无限并打印某种源代码而不是餐厅的名称和地址.我知道我的限制 XPath 表达式中有一些错误,但无法弄清楚它是什么
代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TripadSpider(CrawlSpider):
name = 'tripad'
allowed_domains = ['www.tripadvisor.in']
start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
'Address': response.xpath('(//a[@class="fhGHT"])[2]').get()
}
【问题讨论】:
标签: python web-scraping scrapy