Python urllib.request.urlopen() 返回错误 403答案

【问题标题】：Python urllib.request.urlopen() returning error 403Python urllib.request.urlopen() 返回错误 403
【发布时间】：2016-01-06 03:08:51
【问题描述】：

我正在尝试下载页面的 HTML（在这种情况下为http://www.guangxindai.com），但我收到了错误 403。这是我的代码：

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f = opener.open("http://www.guangxindai.com")
f.read()

但我收到错误响应。

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    f = opener.open("http://www.guangxindai.com")
  File "C:\Python33\lib\urllib\request.py", line 475, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 587, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 513, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 595, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

我尝试了不同的请求标头，但仍然无法获得正确的响应。我可以通过浏览器查看网页。这对我来说似乎很奇怪。我猜网络使用一些方法来阻止网络蜘蛛。有谁知道发生了什么？如何正确获取页面的 HTML？

【问题讨论】：

根据提供的信息，我们只能推断出 rfc 中的内容：403 Forbidden The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.(参见here)
Howerver Wikipedia (here) 有一个“子代码”列表，不确定 urllib 是否支持您检查这些子代码。

标签： python request urlopen

【解决方案1】：

我遇到了和你一样的问题，我在link 中找到了答案。

Stefano Sanfilippo 提供的答案非常简单，对我有用：

from urllib.request import Request, urlopen

url_request = Request("http://www.guangxindai.com", 
                      headers={"User-Agent": "Mozilla/5.0"})
webpage = urlopen(url_request).read()

【讨论】：

【解决方案2】：

如果您的目标是读取页面的 html，您可以使用以下代码。它在 Python 2.7 上对我有用

import urllib
f = urllib.urlopen("http://www.guangxindai.com")
f.read()

【讨论】：

即使这段代码使示例正常运行，我认为@zhangzhai 想要解释他得到 403 的原因。
当我在 2.7.10 上运行它时，我会返回一个页面，但这只是 403 错误页面。它有这样一行：<span class="r-tip01"><script>document.write(error_403);</script></span>
感谢您的回复。我想答案可能没有那么简单。我尝试了不同的请求标头，但仍然无法获得正确的响应。我可以通过浏览器查看网页。这对我来说似乎很奇怪。