【问题标题】:urllib2 returns 404 for a website which displays fine in browsersurllib2 为在浏览器中显示正常的网站返回 404
【发布时间】:2012-08-31 09:42:38
【问题描述】:

我无法使用 urllib2 打开一个特定的 url。同样的方法适用于其他网站,例如“http://www.google.com”,但不适用于本网站(在浏览器中也可以正常显示)。

我的简单代码:

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.experts.scival.com/einstein/"
response=urllib2.urlopen(url)
html=response.read()
soup=BeautifulSoup(html)
print soup

谁能帮我完成它?

这是我遇到的错误:

Traceback (most recent call last):
  File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module>
    response=urllib2.urlopen(url);
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
    result = self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

谢谢

【问题讨论】:

  • 停止在行尾放置分号。这是 Python。
  • 我的错是关于获取参数,但我认为不是你的问题

标签: python html url urllib2


【解决方案1】:

我刚刚尝试过,收到了 404 代码和页面返回。

据推测,它正在执行用户代理检测,但无意或故意不向 python urllib 提供内容。

澄清,通过urllib,我收到urlopen 返回一个带有404 代码和HTML 内容的响应对象。对于 urllib2.urlopen,引发了 urllib2.HTTPError 异常。

我建议您尝试将您的用户代理设置为看起来像浏览器的东西。这里有个问题:Changing user agent on urllib2.urlopen

【讨论】:

  • 这也是我的猜测,你打败了我。
【解决方案2】:

您可以使用try except 来捕获错误

try:
    u = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    return

【讨论】:

    【解决方案3】:

    嗯...您确定该 URL 有效吗?尝试“http://www.google.com” 我有类似的代码,并且 urllib 没有问题。或者您可以使用 try - except 语句来查看错误的详细信息。当然,MattH 的回答与事实非常相似 :)

    【讨论】:

      猜你喜欢
      • 2017-11-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-08-23
      • 1970-01-01
      • 2015-04-13
      • 2014-12-10
      相关资源
      最近更新 更多