【问题标题】:How can I yield the current response URL in scrapy_splash如何在 scrapy_splash 中生成当前响应 URL
【发布时间】:2020-01-23 11:18:42
【问题描述】:

如果我尝试在我的 parse() 方法中使用 response.request.url 来生成 url,它会返回:

http://192.168.99.100:8050/execute

在 Lua 脚本中返回 URL 有效,但我不知道如何在 parse() 方法中生成它。

import scrapy
from scrapy_splash import SplashRequest

class ComputersSpider(scrapy.Spider):
    name = 'computers'
    allowed_domains = ['http://daraz.pk']
    start_urls = ['http://daraz.pk']

    script = ''' 
    function main(splash, args)
    splash.private_mode_enabled = false
    assert(splash:go(args.url))
    assert(splash:wait(1))
    input = assert(splash:select("#q"))
    input:focus()
    input:send_text("computers")

    button = assert(splash:select(".search-box__button--1oH7"))
    button:mouse_click()
    assert(splash:wait(6))
    splash:set_viewport_full()
    return {
        html = splash:html(),
        link = splash:url(),  -- "I WANT TO YIELD THIS THING IN THE PARSE() METHOD"
    }
end '''

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url = url, callback = self.parse, endpoint= 'execute', args = {"wait" : 3, 'lua_source' : self.script})

def parse(self, response):
    link = response.request.url
        yield {
            'URL' : link,
        }

尝试使用response.url,它返回起始url

【问题讨论】:

    标签: web-scraping scrapy scrapy-splash


    【解决方案1】:

    问题已通过将链接替换为 url 以返回 lua 脚本来解决:

    return {
        html = splash:html(),
        url = splash:url(),  -- "I WANT TO YIELD THIS THING IN THE PARSE() METHOD"
    }
    

    然后在 parse 方法中加入这一行:

    yield response.url
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-12-29
      • 2012-02-13
      • 2023-03-02
      • 2010-09-29
      • 1970-01-01
      • 1970-01-01
      • 2021-05-07
      • 1970-01-01
      相关资源
      最近更新 更多