【发布时间】:2021-11-14 08:39:38
【问题描述】:
我创建了使用 scrapy、splash 和代理的蜘蛛。
当我只执行 1 个蜘蛛时,一切正常。 但是,当我尝试使用 CrawlerProcess 时,我的 Spider 不使用代理会导致快速禁止。
蜘蛛代码
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
from my_fake_useragent import UserAgent
ua = UserAgent()
class AdsSpiderSpider2(scrapy.Spider):
name = 'ads_spider'
start_urls = ['https://enqpothya3f4tgj.m.pipedream.net' ]
scritp = '''function main(splash, args)
splash:on_request(function(request)
request:set_proxy{
host = "pl.smartproxy.com",
port = xxxx,
username = xxxx,
password = xxxx,
type = "HTTP"
}
end
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
'''
def start_requests(self):
for url in self.start_urls:
print(url)
yield SplashRequest(url, self.parse,
endpoint='execute',
args={
'wait': 1,
'lua_source': self.scritp,
'js_source': 'document.body',
'proxy' : 'http://[user:password]@pl.smartproxy.com:[xxxx]'
},
headers = {
'User-Agent' : ua.random(),
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'pl,en;q=0.9,en-GB;q=0.8,en-US;q=0.7',
}
)
def parse(self, response):
print("x")
终端
scrapy crawl ads_spider
爬虫进程
但是,当我尝试使用 CrawlerProcess 时,我的 Spider 不使用代理会导致快速禁止。
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(AdsSpiderSpider2)
process.start()
settings.py
BOT_NAME = 'xxxx'
SPIDER_MODULES = ['xxxxx']
NEWSPIDER_MODULE = 'xxxxx'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.5
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
为什么使用 CrawlerProcess 会使代码不使用代理?
【问题讨论】:
标签: proxy scrapy scrapy-splash