【发布时间】:2012-01-27 14:59:07
【问题描述】:
我正在尝试解析从网上下载的任意文档,是的,我无法控制它们的内容。
自从Beautiful Soup won't choke if you give it bad markup... 我想知道为什么有时文档的部分格式不正确时它会给我带来这些麻烦,以及是否有办法让它恢复到下一个可读部分文档,不管这个错误。
发生错误的行是第 3 行:
from BeautifulSoup import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)
CLI 完整输出为:
Traceback (most recent call last):
File "./grablinks", line 101, in <module>
sys.exit(main())
File "./grablinks", line 88, in main
links = grab_links(options)
File "./grablinks", line 36, in grab_links
doc = doc_parser(reader)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
【问题讨论】:
-
您向 BeautifulSoup 提供什么样的输入?根据报错信息,可能你正在解析一些非ascii数据(例如包含非拉丁字符)?
-
我正在解析的数据来自野网,其中一部分肯定是非ascii的。
标签: python unicode beautifulsoup