通过 Beautiful Soup 使用 unicode 解析 HTML 时遇到问题

【问题标题】：Trouble with parsing HTML with unicodes through Beautiful Soup通过 Beautiful Soup 使用 unicode 解析 HTML 时遇到问题
【发布时间】：2011-12-07 19:59:10
【问题描述】：

如果 HTML 包含 ascii 超过 128 的 unicode，Beautiful Soup 似乎无法正常工作（对我来说）。应该使用什么合适的解码编码？

raw = open('index.html').read() BeautifulSoup.BeautifulSoup(raw)

错误

...stacktrace... UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8094: ordinal not in range(128)

【问题讨论】：

标签： python regex html-parsing beautifulsoup

【解决方案1】：

问题不在于解析文件。使用您在对 Marco 的评论中提供的链接，soup = BeautifulSoup(urllib.urlopen(your_link)) 工作得非常好。

只是当您尝试将解析后的数据打印到控制台时才会遇到问题，因为它现在已转换为 Unicode，除非您另有说明，否则 Python 会尝试将其输出为 ASCII。因此，在您的控制台中使用 print soup 而不仅仅是 soup 将起作用。

【讨论】：

如果您不能使用 print 语句，您将如何解决这个问题？（在此处查看更多信息：stackoverflow.com/questions/7769745/…）
你不需要，这就是重点。只有在控制台输出时才有问题。