【问题标题】:How to get session_id when using Crawlera lua script in Scrapy Splash?在 Scrapy Splash 中使用 Crawlera lua 脚本时如何获取 session_id?
【发布时间】:2018-11-27 15:13:00
【问题描述】:

如您所知,当我们尝试将 Scrapy Splash 与 Crawlera 一起使用时,我们会使用此 lua 脚本:

function use_crawlera(splash)
    -- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
    -- Have a look at the file spiders/quotes-js.py to see how to do it.
    -- Find your Crawlera credentials in https://app.scrapinghub.com/
    local user = splash.args.crawlera_user

    local host = 'proxy.crawlera.com'
    local port = 8010
    local session_header = 'X-Crawlera-Session'
    local session_id = 'create'

    splash:on_request(function (request)
        request:set_header('X-Crawlera-Cookies', 'disable')
        request:set_header(session_header, session_id)
        request:set_proxy{host, port, username=user, password=''}
    end)

    splash:on_response_headers(function (response)
        if type(response.headers[session_header]) ~= nil then
            session_id = response.headers[session_header]
        end
    end)
end

function main(splash)
    use_crawlera(splash)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go{
            splash.args.url,
            headers=splash.args.headers,
            http_method=splash.args.http_method,
        })    
            assert(splash:wait(3))
        return {
            html = splash:html(),
            cookies = splash:get_cookies(),
        }
end

在那个 lua 脚本中有一个 session_id 变量,我非常需要它,但是如何从 Scrapy 的响应中访问它?

我尝试过response.session_idresponse.headers['X-Crawlera-Session'],但都不起作用。

【问题讨论】:

    标签: python lua scrapy scrapy-splash crawlera


    【解决方案1】:

    【讨论】:

      【解决方案2】:
      1. 在您的 lua 脚本中也返回 HAR 数据 (https://splash.readthedocs.io/en/stable/scripting-ref.html#splash-har):
          return {
              html = splash:html(),
              har = splash:har(),
              cookies = splash:get_cookies(),
          }
      
      1. 假设您使用的是 scrapy-splash (https://github.com/scrapy-plugins/scrapy-splash),请确保将 execute 端点设置为您的请求:

      meta['splash']['endpoint'] = 'execute'

      如果您使用scrapy.Request,render.json 是默认端点,但对于scrapy_splash.SplashRequest,默认端点是render.html。查看以下 2 个示例以了解如何设置端点:https://github.com/scrapy-plugins/scrapy-splash#requests

      1. 只有现在您可以在 parse 方法中访问 X-Crawlera-Session 标头:
          def parse(self, response):
              headers = json.loads(response.text)['har']['log']['entries'][0]['response']['headers']
              session_id = next(x for x in headers if x['name'] == 'X-Crawlera-Session')['value']
      
      >>> headers = json.loads(response.text)['har']['log']['entries'][0]['response']['headers']
      >>> next(x for x in headers if x['name'] == 'X-Crawlera-Session')
      {u'name': u'X-Crawlera-Session', u'value': u'2124641382'}
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-01-26
        • 2018-04-17
        • 1970-01-01
        • 2018-12-04
        • 1970-01-01
        • 2023-03-29
        • 2019-11-07
        相关资源
        最近更新 更多