【发布时间】:2016-05-27 00:01:07
【问题描述】:
我正在尝试在我的网络爬虫中实现代理。没有代理,我的代码连接到网站没有问题,但是当我尝试添加代理时,突然无法连接!貌似python-requests里没有人发过这个问题的帖子,希望大家能帮帮我!
背景信息:我正在使用 Mac 并在虚拟环境中使用 Anaconda 的 Python 3.4。
这是我在没有代理的情况下工作的代码
proxyDict = {'http': 'http://10.10.1.10:3128'}
def pmc_spider(max_pages, pmid):
start = 1
titles_list = []
url_list = []
url_keys = []
while start <= max_pages:
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)
req = requests.get(url) #this works
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml")
for items in soup.findAll('div', {'class': 'title'}):
title = items.get_text()
titles_list.append(title)
for link in items.findAll('a'):
urlkey = link.get('href')
url_keys.append(urlkey) #url = base + key
url = "http://www.ncbi.nlm.nih.gov"+str(urlkey)
url_list.append(url)
start += 1
return titles_list, url_list, authors_list
根据我正在查看的其他帖子,我应该能够替换这个:
req = requests.get(url)
用这个:
req = requests.get(url, proxies=proxyDict, timeout=2)
但这不起作用! :( 如果我用这行代码运行它,终端会给我一个超时错误
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')
然后我在终端中打印了其中的一些,它们的痕迹不同,但错误相同:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))
为什么在我的代码中添加代理会突然导致我超时?我在几个随机 url 上尝试了它并发生了同样的事情。所以这似乎是代理的问题,而不是我的代码的问题。但是,我现在必须使用代理,所以我需要找到它的根源并修复它。我还为我使用的 VPN 中的代理尝试了几个不同的 IP 地址,所以我知道这些 IP 地址是有效的。
非常感谢您的帮助!谢谢!
【问题讨论】:
-
您是否尝试过将超时时间延长到 2 秒以上?通过代理的流量可能需要比返回时间更长的时间,这会给您看到您看到的错误
-
@KerryM-R 如果我将其更改为 20 秒,它仍然会超时。你想了多久?
-
应该够长了吧,你确认代理(10.10.1.10)响应正确了吗?
-
@KerryM-R 我实际上不知道该怎么做?实际上,我从有关堆栈溢出的帖子中删除了该代理,但我使用的所有其他代理 IP 地址也没有工作。在这种情况下,“代理”只是一个 IP 地址,不是吗?
-
啊,在那种情况下,10.10.1.10 可能只是来自这里的请求文档docs.python-requests.org/en/master/user/advanced/#proxies 因为它只是充当 http 代理,您应该能够使用浏览器测试与它的连接,只是使用相同的细节,看看它是否可操作
标签: python proxy python-requests