使用代理时超时答案

【问题标题】：Timing out when I use a proxy使用代理时超时
【发布时间】：2016-05-27 00:01:07
【问题描述】：

我正在尝试在我的网络爬虫中实现代理。没有代理，我的代码连接到网站没有问题，但是当我尝试添加代理时，突然无法连接！貌似python-requests里没有人发过这个问题的帖子，希望大家能帮帮我！

背景信息：我正在使用 Mac 并在虚拟环境中使用 Anaconda 的 Python 3.4。

这是我在没有代理的情况下工作的代码

proxyDict = {'http': 'http://10.10.1.10:3128'}

def pmc_spider(max_pages, pmid): 
    start = 1

    titles_list = []
    url_list = []
    url_keys = []

    while start <= max_pages:
        url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)

        req = requests.get(url) #this works
        plain_text = req.text
        soup = BeautifulSoup(plain_text, "lxml")

        for items in soup.findAll('div', {'class': 'title'}):
            title = items.get_text()
            titles_list.append(title)

            for link in items.findAll('a'):
                urlkey = link.get('href')
                url_keys.append(urlkey)   #url = base + key
                url =  "http://www.ncbi.nlm.nih.gov"+str(urlkey)
                url_list.append(url)

        start += 1
    return titles_list, url_list, authors_list

根据我正在查看的其他帖子，我应该能够替换这个：

req = requests.get(url)

用这个：

req = requests.get(url, proxies=proxyDict, timeout=2)

但这不起作用！ :( 如果我用这行代码运行它，终端会给我一个超时错误

socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError:       (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')

然后我在终端中打印了其中的一些，它们的痕迹不同，但错误相同：

 During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
 File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))

为什么在我的代码中添加代理会突然导致我超时？我在几个随机 url 上尝试了它并发生了同样的事情。所以这似乎是代理的问题，而不是我的代码的问题。但是，我现在必须使用代理，所以我需要找到它的根源并修复它。我还为我使用的 VPN 中的代理尝试了几个不同的 IP 地址，所以我知道这些 IP 地址是有效的。

非常感谢您的帮助！谢谢！

【问题讨论】：

您是否尝试过将超时时间延长到 2 秒以上？通过代理的流量可能需要比返回时间更长的时间，这会给您看到您看到的错误
@KerryM-R 如果我将其更改为 20 秒，它仍然会超时。你想了多久？
应该够长了吧，你确认代理（10.10.1.10）响应正确了吗？
@KerryM-R 我实际上不知道该怎么做？实际上，我从有关堆栈溢出的帖子中删除了该代理，但我使用的所有其他代理 IP 地址也没有工作。在这种情况下，“代理”只是一个 IP 地址，不是吗？
啊，在那种情况下，10.10.1.10 可能只是来自这里的请求文档docs.python-requests.org/en/master/user/advanced/#proxies 因为它只是充当 http 代理，您应该能够使用浏览器测试与它的连接，只是使用相同的细节，看看它是否可操作

标签： python proxy python-requests

【解决方案1】：

您似乎需要使用能够响应请求的 http 或 https 代理。

您代码中的10.10.1.10:3128 似乎来自requests documentation 中的示例

从http://proxylist.hidemyass.com/search-1291967（可能不是最佳来源）的列表中获取代理，您的 proxyDict 应如下所示：{'http' : 'http://209.242.141.60:8080'}

在命令行上测试它似乎工作正常：

>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>

【讨论】：

这成功了！虽然当我将该代理放入我的网络浏览器时，它会将我带到一些登录页面。在验证代理是否有效时，我通常应该期待吗？非常感谢！
只要它将您的流量重定向到其他地方（这可能是不受欢迎的），或者当您google what's my ip 代理应该工作时您的 IP 显示不同。在日志运行中，最好设置自己的代理或使用付费的代理