BeautifulSoup 没有给我 Unicode答案

【问题标题】：BeautifulSoup doesn't give me UnicodeBeautifulSoup 没有给我 Unicode
【发布时间】：2011-03-12 16:21:37
【问题描述】：

我正在使用 Beautiful soup 来抓取数据。 BS 文档指出 BS 应始终返回 Unicode，但我似乎无法获得 Unicode。这是一个代码sn-p

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

页面返回的原始数据是一个字符串。 BS 将原始编码显示为 ISO-8859-1。我以为 BS 会自动将所有内容都转换为 Unicode，所以当我这样做时为什么会这样：

table = soup.table
print type(table.renderContents())

..它给了我一个字符串对象而不是 Unicode？p>

如何从 BS 获取 Unicode 对象？

我真的，真的迷失了这个。有什么帮助吗？提前致谢。

【问题讨论】：

标签： python unicode character-encoding beautifulsoup

【解决方案1】：

您可能已经注意到 renderContent 返回（默认情况下）一个以 UTF-8 编码的字符串，但如果您真的想要一个代表整个文档的 Unicode 字符串，您也可以执行 unicode(soup) 或解码 renderContents/prettify 的输出使用 unicode(soup.prettify(), "utf-8")。

相关

How to render contents of a tag in unicode in BeautifulSoup?

【讨论】：

【解决方案2】：

originalEncoding 正是源编码，因此 BS 在内部将所有内容存储为 unicode 的事实不会改变该值。当你遍历树时，所有文本节点都是 unicode，所有标签都是 unicode，等等，除非你以其他方式转换它们（比如使用 print、str、prettify 或 renderContents）。

尝试做类似的事情：

soup = BeautifulSoup(data)
print type(soup.contents[0])

不幸的是，到目前为止，您所做的所有其他事情都发现 BS 中很少有方法可以转换为字符串。

【讨论】：

它给了我<class 'libs.BeautifulSoup.BeautifulSoup.Declaration'> for type(soup.contents[0]) 和<type 'instance'> for type(soup.contents[2])
我查看了 BS 源代码，发现要获取 Unicode 字符串，您必须调用 renderContents(None)。这将返回 Unicode。我不知道为什么文档另有说明。
@mridang: 是的，我应该给你一个文件来试一试——你的格式很好，所以contents 中的前几个元素将是创建真实@987654333 的元数据@对象。要么尝试在文档中举例，要么真实地遍历树并获取标签名称和文本，而不使用文档中调用的方法，因为具体 not 返回 unicode（如renderContents）。