无法摆脱 process_exception 引发的一些错误答案

【问题标题】：Unable to get rid of some error raised by process_exception无法摆脱 process_exception 引发的一些错误
【发布时间】：2021-01-07 10:59:32
【问题描述】：

我试图在RetryMiddleware 中的process_response 中不显示/获取由scrapy 引发的一些错误。超过最大重试限制时脚本遇到的错误。我在中间件中使用了代理。奇怪的是脚本抛出的异常已经在EXCEPTIONS_TO_RETRY 列表中。脚本有时可能会超过最大重试次数而没有任何成功，这是完全可以的。但是，我只是不希望看到该错误，即使它存在，这意味着抑制或绕过它。

错误是这样的：

Traceback (most recent call last):
  File "middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..

这是process_response 在RetryMiddleware 中的样子：

class RetryMiddleware(object):
    cus_retry = 3
    EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \
        ConnectionRefusedError, ConnectionDone, ConnectError, \
        ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
                and not request.meta.get('dont_retry', False):
            return self._retry(request, exception, spider)

    def _retry(self, request, reason, spider):
        retries = request.meta.get('cus_retry',0) + 1
        if retries<=self.cus_retry:
            r = request.copy()
            r.meta['cus_retry'] = retries
            r.meta['proxy'] = f'https://{ip:port}'
            r.dont_filter = True
            return r
        else:
            print("done retrying")

如何消除EXCEPTIONS_TO_RETRY 中的错误？

PS：无论我选择哪个站点，当达到最大重试限制时脚本遇到的错误。

【问题讨论】：

如果禁用该中间件，是否不会发生超时？
不。当我禁用自定义中间件 @Gallaecio 时，它不会发生。
如果你直接在你的蜘蛛中使用 UserAgent() 并从你的蜘蛛中设置 User-Agent 会发生什么。这有效还是超时？

标签： python python-3.x web-scraping scrapy middleware

【解决方案1】：

也许问题不在您这边，但第三方网站可能有问题。也许他们的服务器上存在连接错误，或者它是安全的，所以没有人可以访问它。

因为该错误甚至说该错误是由一方能够关闭或无法正常工作可能首先检查第三方网站是否在请求时工作。如果可以，请尝试与他们联系。

因为错误不在你的一方，而是在一方的一方，正如错误所说的那样。

这个问题类似于Scrapy - Set TCP Connect Timeout

【讨论】：

【解决方案2】：

当达到最大重试次数时，parse_error() 之类的方法应处理任何错误（如果它存在于您的蜘蛛中）：

def start_requests(self):
    for start_url in self.start_urls:
        yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)

def parse_error(self, failure):
    # print(repr(failure))
    pass

但是，我想在这里提出一种完全不同的方法。如果您走以下路线，则根本不需要任何自定义中间件。包括重试逻辑在内的所有内容都已经存在于蜘蛛中。

class mySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "some url",
    ]

    proxies = [] #list of proxies here
    max_retries = 5
    retry_urls = {}

    def parse_error(self, failure):
        proxy = f'https://{ip:port}'
        retry_url = failure.request.url
        if retry_url not in self.retry_urls:
            self.retry_urls[retry_url] = 1
        else:
            self.retry_urls[retry_url] += 1
        
        if self.retry_urls[retry_url] <= self.max_retries:
            yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)
        else:
            print("gave up retrying")

    def start_requests(self):
        for start_url in self.start_urls:
            proxy = f'https://{ip:port}'
            yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)

    def parse(self,response):
        for item in response.css().getall():
            print(item)

不要忘记添加以下行以获得上述建议的上述结果：

custom_settings = {
    'DOWNLOADER_MIDDLEWARES': {
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    }
}

顺便说一下，我使用的是scrapy 2.3.0。

【讨论】：

【解决方案3】：

尝试修复刮板本身的代码。有时，解析函数不好会导致您所描述的那种错误。一旦我修复了代码，它就消失了。

【讨论】：