【问题标题】:Python + Mechanize not working with DeliciousPython + Mechanize 不能与 Delicious 一起使用
【发布时间】:2011-05-27 10:56:53
【问题描述】:

我正在使用 Mechanize and Beautiful soup 从 Delicious 中刮取一些数据

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
url = "http://www.delicious.com/varunsrin"
page = mech.open(url)
html = page.read()

soup = BeautifulSoup(html)
print soup.prettify()

这适用于我扔它的大多数网站,但在 Delicious 上失败并显示以下输出

Traceback (most recent call last):  
File "C:\Users\Varun\Desktop\Python-3.py",
line 7, in <module>
    page = mech.open(url)
File "C:\Python26\lib\site-packages\mechanize\_mechanize.py",
line 203, in open
    return self._mech_open(url, data, timeout=timeout)   File
"C:\Python26\lib\site-packages\mechanize\_mechanize.py",
line 255, in _mech_open
    raise response httperror_seek_wrapper: HTTP Error
403: request disallowed by robots.txt
C:\Program Files (x86)\ActiveState Komodo IDE 6\lib\support\dbgp\pythonlib\dbgp\client.py:1360:
DeprecationWarning:
BaseException.message has been deprecated as of Python 2.6
    child = getattr(self.value, childStr)
C:\Program Files (x86)\ActiveState Komodo IDE 6\lib\support\dbgp\pythonlib\dbgp\client.py:456:
DeprecationWarning:
BaseException.message has been deprecated as of Python 2.6
    return apply(func, args)

【问题讨论】:

  • 真的,请查看回溯“robots.txt 不允许的请求” - 所以您搜索 mechanize robots.txt。你会发现 Mechanize 主页告诉你robots.txt 和`set_handle_robots。

标签: python web-crawler mechanize scraper


【解决方案1】:

here 获取一些使用 python+mechanize 模拟浏览器的技巧。添加addheadersset_handle_robots 似乎是最低要求。使用下面的代码,我得到输出:

from mechanize import Browser, _http
from BeautifulSoup import BeautifulSoup

br = Browser()    
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

url = "http://www.delicious.com/varunsrin"
page = br.open(url)
html = page.read()

soup = BeautifulSoup(html)
print soup.prettify()

【讨论】:

    猜你喜欢
    • 2023-04-05
    • 2020-10-03
    • 1970-01-01
    • 1970-01-01
    • 2013-07-11
    • 1970-01-01
    • 2023-03-06
    • 1970-01-01
    • 2022-07-21
    相关资源
    最近更新 更多