【问题标题】:Why is same URL scraped twice instead of two different start_urls?为什么相同的 URL 被抓取两次而不是两个不同的 start_url?
【发布时间】:2022-01-22 13:02:21
【问题描述】:

我有以下蜘蛛:

class SpiderOpTest(Spider):
    
    name = "test"
    start_urls = [
        "https://www.oddsportal.com/tennis/argentina/atp-buenos-aires/results/#/page/2/",
        "https://www.oddsportal.com/tennis/argentina/atp-buenos-aires-2012/results/#/page/2/",
    ]
    custom_settings = {
        "USER_AGENT": "*",
        "LOG_LEVEL": "WARNING",
        "DOWNLOADER_MIDDLEWARES": {'scraper_scrapy.odds.middlewares.SeleniumMiddleware': 543},
    }
    httperror_allowed_codes = [301]

    def parse(self, response):
        print(f"Parsing tournament page - {response.url}")

当我运行它时,打印输出表明start_urls 的第一个 URL 已被抓取两次。为什么会这样?

由于页面的关键部分是通过 Javascript 加载的,因此包含我正在使用的 Selenium 中间件可能对我有用:

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver


class SeleniumMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        self.driver.get(request.url)
        return HtmlResponse(
            self.driver.current_url,
            body=self.driver.page_source,
            encoding='utf-8',
            request=request,
        )

    def spider_opened(self, spider):
        options = webdriver.FirefoxOptions()
        options.add_argument("--headless")
        self.driver = webdriver.Firefox(options=options)

    def spider_closed(self, spider):
        self.driver.close()

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    您的问题似乎在这一行:self.driver.current_url,。驱动程序的 URL 设置为第一个 URL,并且永远不会更新。我认为您应该在该行使用request.url

        def process_request(self, request, spider):
            self.driver.get(request.url)
            return HtmlResponse(
                request.url,
                body=self.driver.page_source,
                encoding='utf-8',
                request=request,
            )
    

    【讨论】:

    • 太棒了 - 感谢一百万。它一直让我发疯!只要允许我,我就会奖励赏金......
    猜你喜欢
    • 1970-01-01
    • 2019-02-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-02-12
    • 1970-01-01
    • 2016-05-12
    相关资源
    最近更新 更多