【问题标题】:scrapy is not crawling through the linksscrapy 没有通过链接爬行
【发布时间】:2022-01-17 15:00:24
【问题描述】:

我通过链接提取器使用scrapy进行爬行,我在scrapy链接提取器中使用了正确的XPath表达式,但我不知道为什么它会无限并打印某种源代码而不是餐厅的名称和地址.我知道我的限制 XPath 表达式中有一些错误,但无法弄清楚它是什么

代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TripadSpider(CrawlSpider):
    name = 'tripad'
    allowed_domains = ['www.tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
            'Address': response.xpath('(//a[@class="fhGHT"])[2]').get()
        }

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    它正在爬行,尝试更改您的 user_agent。但是您忘记在地址中添加/text()

    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class TripadSpider(CrawlSpider):
        name = 'tripad'
        allowed_domains = ['tripadvisor.in']
        start_urls = ['https://www.tripadvisor.in/Restaurants-g304551-New_Delhi_National_Capital_Territory_of_Delhi.html']
    
        rules = (
            Rule(LinkExtractor(restrict_xpaths='//div[@class="OhCyu"]//a'), callback='parse_item'),
            Rule(LinkExtractor(restrict_xpaths='//a[contains(@class, "next")]')),   # pagination
        )
    
        def parse_item(self, response):
            yield {
                'title': response.xpath('//h1[@class="fHibz"]/text()').get(),
                'Address': response.xpath('(//a[@class="fhGHT"])[2]/text()').get()
            }
    

    输出:

    {'title': 'Mosaic', 'Address': 'Sector 10 Lobby Level Crowne Plaza Twin District Centre, Rohini, New Delhi 110085 India'}
    {'title': 'Spring', 'Address': 'Plot 4, Dwarka City Centre Radisson Blu, Sector 13, New Delhi 110075 India'}
    {'title': 'Dilli 32', 'Address': 'Maharaja Surajmal Road The Leela Ambience Convention Hotel, Near Yamuna Sports Complex, Vivek Vihar, New Delhi 110002 India'}
    {'title': 'Viva - All Day Dining', 'Address': 'Hospitality District Asset Area 12 Gurgoan sector 28, New Delhi 110037 India'}
    ...
    ...
    ...
    

    【讨论】:

    • 感谢问题没有添加这个 /text()
    • 我在这里应用了分页,但它只需要60个条目,你知道为什么
    • @KrishanGopalSharma 因为它只抓取第一页,您需要为next page 添加另一条规则。查看编辑。
    • 是的,我应用了这个,但你有没有注意到你只得到了 60 个条目,不超过这个
    • 我从每一页得到每一个结果。绝对超过 60 个。
    猜你喜欢
    • 2021-08-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-12-31
    • 2013-11-30
    • 2016-08-03
    • 1970-01-01
    相关资源
    最近更新 更多