【问题标题】:Python - Scrapy splash can't render this pagePython - Scrapy splash 无法呈现此页面
【发布时间】:2019-01-24 23:14:31
【问题描述】:

https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=08/16/2018

这是我要抓取的页面。当我使用 SplashRequest 打开它时,我会得到一个具有相同来源的不同页面。 这些是我对 slas 的设置:

ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
810,
}
SPIDER_MIDDLEWARES = {
     'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

我的蜘蛛代码: 导入scrapy 从 scrapy_splash 导入 SplashRequest

class RealForeclosure(scrapy.Spider):
    name = 'realForeclosure'
    start_urls = [
    'https://www.miamidade.realforeclose.com/index.cfm? 
zaction=user&zmethod=calendar'
        ]

    def parse(self,response):
        link = 'https://www.miamidade.realforeclose.com/index.cfm? 
 zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE='
        date = response.xpath('//div[@tabindex="0"]/@dayid').extract()[10]
        yield SplashRequest(link+date, callback=self.auction)

    def auction(self, response):
        for i in response.css('.AUCTION_ITEM').extract():
            yield {'item':i}

【问题讨论】:

  • 请发布您的蜘蛛代码
  • 我添加了蜘蛛代码

标签: python web-scraping scrapy scrapy-splash


【解决方案1】:

您需要某种延迟来允许 Splash 渲染结果:

script1 = """
            function main(splash, args)
            assert (splash:go(args.url))
            assert (splash:wait(0.5))
            return {
                html = splash: html(),
                png = splash:png(),
                har = splash:har(),
            }
            end
          """

def parse(self,response):
    link = 'https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE='
    date = response.xpath('//div[@tabindex="0"]/@dayid').extract()[10]
    yield SplashRequest(
        link+date,
        callback=self.auction, 
        endpoint='execute',
        args={
            'html': 1,
            'lua_source': self.script1,
            'wait': 0.5,
        }
)

【讨论】:

    猜你喜欢
    • 2018-12-28
    • 1970-01-01
    • 2018-02-13
    • 1970-01-01
    • 2021-08-16
    • 2020-01-04
    • 1970-01-01
    • 2017-12-22
    • 2018-05-30
    相关资源
    最近更新 更多