【发布时间】:2012-02-04 12:59:45
【问题描述】:
我正在尝试使用 Python 中的 BeautifulSoup 和 Mechanize 为一个学术项目编写一个简单的抓取程序。我试图从亚马逊获取一些产品的价格,因为我想测试他们定价模型的各种理论。我遇到的问题是 BeautifulSoup 随机不会从 Mechanize 获取整个 HTML 页面。我已将出现错误的时间打印到文本文件中,并且每次 Mechanize 页面完全形成时,但 BeautifulSoup 页面只有一半。这是我的代码:
def process_product_url(product_url):
"""Scrapes and returns all the data in the given product url"""
#Download product_page given product_url
product_page_mech, product_page_bs = get_product_page_mech_bs(product_url)
#Extract Price
price = extract_price(product_page_bs)
return price
def get_product_page_mech_bs(url):
"""Takes a product page url in str and returns the mech page and bs page"""
while True:
mech_page = get_mech_page(url)
bs_page = BeautifulSoup(unicode(mech_page.response().read(), 'latin-1'))
if not test_product_page(bs_page):
log(unicode(bs_page))
log(unicode(mech_page.response().read(), 'latin-1'))
continue
return mech_page, bs_page
def test_product_page(product_page_bs):
"""Takes a BS product page and tests to see if proper"""
if rank_page_bs.findAll('span', attrs={'id' : 'actualPriceValue'}) == []:
return False
else:
return True
def get_mech_page(url):
"""Given a URL, returns Mechanize page object"""
while True:
try:
br = initialize_browser()
br.open(url)
return br
except Exception, e:
print e
print traceback.print_exc()
continue
def initialize_browser():
"""Returns a fully setup mechanize browser instance"""
br = mechanize.Browser()
br.addheaders = [("User-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0.1) Gecko/20100101 Firefox/9.0.1")]
return br
我已经上传了这个页面的BeautifulSoup output和Mechanize output:http://www.amazon.com/Fujifilm-X-Pro-Digital-Camera-Body/dp/B006UV6YMQ/ref=sr_1_2?s= electronics&ie=UTF8&qid=1328359488&sr=1-2(我不能粘贴两个以上的链接)
编辑:澄清和扩展
【问题讨论】:
-
也许如果我们有
get_mech_page的实现和一个展示问题的url,有人会尝试一下。 -
或者,您能否打印
mech_page和bs_page,并将它们与使用urllib2.urlopen(url).read()返回的原始HTML 进行比较?
标签: python beautifulsoup mechanize