被阻止时如何在Scrapy中暂停蜘蛛答案

【问题标题】：How to pause spider in Scrapy when blocked被阻止时如何在Scrapy中暂停蜘蛛
【发布时间】：2018-11-26 14:17:53
【问题描述】：

我在做一个运行在内网服务器的Scrapy项目，我必须设置一个Proxy才能连接到外面，这样我就不能使用Proxy方式（更改IP）来防止被封禁。

中间件.py：

class SetProxy(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = os.getenv('HTTP_PROXY')

我的目标网站即使屏蔽了我也会返回 200 状态响应，所以我只能识别响应的内容来检查我是否被屏蔽

mySpider.py:

def parse(self, response):

    block_msg1 = "FOR SECURITY REASONS, THIS PAGE CAN NOT BE ACCESSED!"
    block_msg2 = "Overrun"

    # not be banned
    if str(response.body).find(block_msg1) == -1 and str(response.body).find(block_msg2) == -1:
        ......        
        yield item

    # get banned
    elif str(response.body).find(block_msg1) != -1 or str(response.body).find(block_msg2) != -1:

        # I want to pause Scrapy (stop sending requests but not stop pipelines' work) for a while here but I don't know how

        yield scrapy.Request(url=response.url, headers=sub_headers, callback=self.parse_sub)

当我发现在方法 parse_sub() 中被禁止时，如何让 Scrapy 停止发送请求一段时间但不停止管道，并在 n 分钟后恢复？

【问题讨论】：

嗯...我认为有一个self.crawler.pause()/unpause() 可用...但我不记得您是否需要使用特定设置运行爬网以保留作业历史记录.. .
简单的time.sleep 可以工作

标签： python scrapy web-crawler scrapy-spider

【解决方案1】：

您有几个选项可以解决您的问题：

在设置中更改download delay 和/或concurrent requests
使用scrapy AutoThrottle extension

【讨论】：