【问题标题】:Scrapy: how to get information from all tabs on the page?Scrapy:如何从页面上的所有选项卡中获取信息?
【发布时间】:2023-03-21 13:00:02
【问题描述】:

在这个page 上,我需要从所有选项卡(个人资料、评论、电话号码和路线)中获取信息。

wellness.py

def profile(self, response):
    services = response.xpath('.//span[contains(text(),"Services")]')
    education = response.xpath('.//span[contains(text(),"Education")]')
    training = response.xpath('.//span[contains(text(),"Training")]')

    yield {
            'First and Last name': response.css('h1::text').get(),
            'About': response.css('.listing-about::text').get(),
            'Services': services.xpath('following-sibling::span[1]/text()').extract(),
            'Primary Specialty': response.css('.normal::text').get(),
            'Address': ' '.join([i.strip() for i in response.css('.office-address span::text').getall()]),
            'Practice': response.css('.years-in-service::text').get(),
            'Education': education.xpath('following-sibling::span[1]/text()').extract(),
            'Training': training.xpath('following-sibling::span[1]/text()').extract(),
            'Consumer Feedback': response.css('.item-rating-container a::text').get()                
        }

【问题讨论】:

    标签: html web dom scrapy


    【解决方案1】:

    每个标签都加载一个单独的页面/网址。我想你认为因为它是标签页,所以它是同一页。所以你必须从第一页收集你想要的数据,请求第二页获取数据,然后请求第三页。您可以通过在元属性中传递项目来保留上一页的数据。我就是这样做的。请注意链接的代码是正确的,您必须为每个页面上的数据点制作选择器。

    def profile(self, response):
        item = {}
        item["field1"] = response.xpath('//xpath').get()
        # Get first link for reviews
        review_link = response.css('#reviews_tab a::attr(href)').get()
        yield scrapy.Request(response.urljoin(review_link), callback=self.parse_reviews, meta={'item': item})
    
    def parse_reviews(self, response):
        item = response.meta['item']
        item["field2"] = response.xpath
        directions_link = response.css('#directions_tab a:attr(href)').get()
        yield scrapy.Request(response.urljoin(directions_link), callback=self.parse_directions, meta={'item': item})
    
    def parse_directions(self, response):
        item = response.meta['item']
        item['directions'] = response.xpath
        yield item
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-04-03
      • 1970-01-01
      • 2019-03-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-12-03
      相关资源
      最近更新 更多