Python：Urllib2 返回 404答案

【问题标题】：Python: Urllib2 returns 404Python：Urllib2 返回 404
【发布时间】：2014-03-24 04:41:31
【问题描述】：

我正在尝试使用 python 从 URL 中读取一些内容，但每次尝试时都会收到 404。

这是我的测试代码，以及有问题的 URL：

url = 'http://supercoach.heraldsun.com.au'

headers = {"User-agent": "Mozilla/5.0"}
req = urllib2.Request(url, None, headers)
try:
   handle = urllib2.urlopen(req)
except IOError, e:
    print e.code

该站点在浏览器中运行良好，我之前对此脚本没有任何问题，但最近对该站点的更新导致它失败。

我已尝试添加用户代理标头，因为类似问题有此建议。

任何想法为什么这不起作用？

谢谢 JP

【问题讨论】：

@Ruben，所以你得到 303？这是通过运行上面的代码确定的吗？我肯定会得到 404，但也许这是系统特定的事情。

标签： python urllib2

【解决方案1】：

使用requests，它为 Python 中的库提供了友好的包装；还有handles redirection for you。

您的请求代码很简单：

import requests
r = requests.get('http://supercoach.heraldsun.com.au')

【讨论】：

感谢 Burhan，但目前无法在我所在的位置安装软件包
您可以直接下载并保存在与其他文件相同的目录中；仅当您希望它们在系统范围内可用时才安装软件包（或者如果它们提供任何命令行实用程序/帮助程序）。
urllib2 可以跟踪重定向。您的代码产生TooManyRedirects: Exceeded 30 redirects.
它可以（当然），我只是说请求不需要那么多代码来完成它。 requests 在后台使用 urllib 和其他。

【解决方案2】：

尝试设置 cookie 并增加允许的重定向次数：

import urllib2
from cookielib import CookieJar

class RedirectHandler(urllib2.HTTPRedirectHandler):
    max_repeats = 100
    max_redirections = 1000

    def http_error_302(self, req, fp, code, msg, headers):
        print code
        print headers
        return urllib2.HTTPRedirectHandler.http_error_302(
            self, req, fp, code, msg, headers)
    http_error_300 = http_error_302
    http_error_301 = http_error_302
    http_error_303 = http_error_302
    http_error_307 = http_error_302

cookiejar = CookieJar()
urlopen = urllib2.build_opener(RedirectHandler(),
                               urllib2.HTTPCookieProcessor(cookiejar)).open
request = urllib2.Request('http://supercoach.heraldsun.com.au',
                          headers={"User-agent": "Mozilla/5.0"})
response = urlopen(request)
print '*' * 60
print response.info()
print response.read()
response.close()

【讨论】：