使用 python Mechanize 修复网页的字符编码

【问题标题】：Fix Character encoding of webpage using python Mechanize使用 python Mechanize 修复网页的字符编码
【发布时间】：2015-05-29 07:02:11
【问题描述】：

我正在尝试使用 Mechanize 在此 page 上提交表单。

br.open("http://mspc.bii.a-star.edu.sg/tankp/run_depth.html")
#selecting form to fill
br.select_form(nr = 0)
#input for the form
br['pdb_id'] = '1atp'
req = br.submit()

然而这会产生以下错误

mechanize._form.ParseError: expected name token at '<! INPUT PDB FILE>\n\t'

我认为这是因为一些错误的字符编码(ref)。我想知道如何解决这个问题。

【问题讨论】：

标签： python mechanize

【解决方案1】：

您的问题是一些损坏的HTML comment tags，导致机械化解析器无法读取的无效网站。但是你可以 use the included BeautifulSoup parser 代替，这适用于我的情况（Python 2.7.9，mechanize 0.2.5）：

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import mechanize

br = mechanize.Browser(factory=mechanize.RobustFactory())
br.open('http://mspc.bii.a-star.edu.sg/tankp/run_depth.html')
br.select_form(nr=0)
br['pdb_id'] = '1atp'
response = br.submit()

【讨论】：