【发布时间】:2016-01-06 03:08:51
【问题描述】:
我正在尝试下载页面的 HTML(在这种情况下为http://www.guangxindai.com),但我收到了错误 403。这是我的代码:
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()
但我收到错误响应。
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
f = opener.open("http://www.guangxindai.com")
File "C:\Python33\lib\urllib\request.py", line 475, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 587, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 513, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
我尝试了不同的请求标头,但仍然无法获得正确的响应。我可以通过浏览器查看网页。这对我来说似乎很奇怪。我猜网络使用一些方法来阻止网络蜘蛛。有谁知道发生了什么?如何正确获取页面的 HTML?
【问题讨论】:
-
根据提供的信息,我们只能推断出 rfc 中的内容:
403 Forbidden The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.(参见here) -
Howerver Wikipedia (here) 有一个“子代码”列表,不确定 urllib 是否支持您检查这些子代码。