【问题标题】:Crawled but not scraped已抓取但未抓取
【发布时间】:2020-11-29 10:36:18
【问题描述】:

我正在尝试使用 Scrapy 抓取以下网站 (https://www.leadhome.co.za/search/property-for-sale/western-cape/4?sort=date'),我看到该页面已被抓取,但没有返回任何项目。一切都可以使用 Scrapy Shell。

这是我的代码:

class LeadHomeSpider(scrapy.Spider):
    name = "lead_home"
    start_urls = [
        'https://www.leadhome.co.za/search/property-for-sale/western-cape/4?sort=date',
    ]

    # parse search page
    def parse(self, response):
        # follow property link
        offering = 'buy' if 'sale' in response.css('h1::text').get() else 'rent'
        for prop in response.css('div.search__PropertyCardWrapper-sc-1j5dndx-0.bsqBpI'):
            link = 'https://www.leadhome.co.za' + prop.css('a::attr(href)').get()
            a = prop.css('p.styles__Label-h53xsw-16.bcSkCI::text').getall()
            #prop_type = attempt_get_property_type(a[0]) if len(a) != 0 else None
            area = a[1] if len(a) > 1 else None

            yield scrapy.Request(
                link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    'area': area,
                    'offering': offering,
                    #'property_type': prop_type,
                }},
                callback=self.parse_property,
            )

        # follow to next page
        next_page_number = response.xpath(
            '//a[contains(@class, "styles__PageNumber-zln67a-0 jRCKhp")]/following-sibling::a/text()').get()
        if next_page_number is not None:
            new_page_link = 'https://www.leadhome.co.za/search/property-for-sale/western-cape/4?sort=date&page=' + next_page_number
            next_page = response.urljoin(new_page_link)
            yield scrapy.Request(next_page, callback=self.parse)

    # parse property
    def parse_property(self, response):
        item = response.meta.get('item')
        item['parking'] = response.xpath('//p[contains(text(), "Uncovered Parking:")]/following-sibling::p/text()').get()
   
...

知道这里可能有什么问题吗?欢迎任何建议!提前谢谢!

【问题讨论】:

    标签: python web-scraping scrapy web-crawler


    【解决方案1】:

    您在 CSS 表达式中使用了 random 类值(1j5dndx-0.bsqBpI 等),这就是您的代码不起作用的原因。这是相同的代码,但使用 XPath 的 contains 来匹配类的一部分:

    def parse(self, response):
        # follow property link
        offering = 'buy' if 'sale' in response.css('h1::text').get() else 'rent'
        # for prop in response.css('div.search__PropertyCardWrapper-sc-1j5dndx-0.bsqBpI'):
        for prop in response.xpath('//div[contains(@class, "search__PropertyCardWrapper-sc-")]'):
            link = prop.xpath('.//a/@href').get()
            # a = prop.css('p.styles__Label-h53xsw-16.bcSkCI::text').getall()
            prop_type = prop.xpath('(.//p[contains(@class, "styles__Label-")])[1]/text()').get()
            # area = a[1] if len(a) > 1 else None
    
            link = response.urljoin(link)
            yield scrapy.Request(
                url=link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    # 'area': area,
                    'offering': offering,
                    'property_type': prop_type,
                }},
                callback=self.parse_property,
            )
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-02-13
      • 2018-09-17
      • 2017-09-04
      • 2018-09-22
      • 1970-01-01
      • 1970-01-01
      • 2016-10-18
      相关资源
      最近更新 更多