【问题标题】:Scrapy CrawlerProcess don't use ProxyScrapy CrawlerProcess 不使用代理
【发布时间】:2021-11-14 08:39:38
【问题描述】:

我创建了使用 scrapy、splash 和代理的蜘蛛。

当我只执行 1 个蜘蛛时,一切正常。 但是,当我尝试使用 CrawlerProcess 时,我的 Spider 不使用代理会导致快速禁止。

蜘蛛代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
from my_fake_useragent import UserAgent
ua = UserAgent()


class AdsSpiderSpider2(scrapy.Spider):
    name = 'ads_spider'
    start_urls = ['https://enqpothya3f4tgj.m.pipedream.net' ]

    scritp = '''function main(splash, args)
            splash:on_request(function(request)
                request:set_proxy{
                host = "pl.smartproxy.com",
                port = xxxx,
                username = xxxx,
                password = xxxx,
                type = "HTTP"
                }
            end
            )
            assert(splash:go(args.url))
            assert(splash:wait(0.5))
            
            return {
                html = splash:html(),
                png = splash:png(),
                har = splash:har(),
            }
            end
    '''

    def start_requests(self):
        for url in self.start_urls:
            print(url)
            yield SplashRequest(url, self.parse,
                endpoint='execute',
                args={
                    'wait': 1,
                    'lua_source': self.scritp,
                    'js_source': 'document.body',
                    'proxy' : 'http://[user:password]@pl.smartproxy.com:[xxxx]'
                    },
                headers = {
                    'User-Agent' : ua.random(), 
                    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                    'accept-language': 'pl,en;q=0.9,en-GB;q=0.8,en-US;q=0.7',
                        }
                    )


    def parse(self, response):
         print("x")

终端

scrapy crawl ads_spider

爬虫进程

但是,当我尝试使用 CrawlerProcess 时,我的 Spider 不使用代理会导致快速禁止。

from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(AdsSpiderSpider2)
process.start()

settings.py

BOT_NAME = 'xxxx'

SPIDER_MODULES = ['xxxxx']
NEWSPIDER_MODULE = 'xxxxx'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050'

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

为什么使用 CrawlerProcess 会使代码不使用代理?

【问题讨论】:

    标签: proxy scrapy scrapy-splash


    【解决方案1】:

    您需要将设置对象显式传递给CrawlerProcess 构造函数,即

    1. 将此导入添加到蜘蛛文件from scrapy.utils.project import get_project_settings
    2. 将行process = CrawlerProcess()更改为process = CrawlerProcess(settings=get_project_settings())

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-01-25
      • 1970-01-01
      • 2019-04-01
      • 1970-01-01
      • 2019-07-31
      • 1970-01-01
      • 2022-12-04
      • 2018-08-01
      相关资源
      最近更新 更多