在 python 2.7 中读取包含 unicode 的 XML 文件答案

【问题标题】：Reading an XML file which contains unicode in python 2.7在 python 2.7 中读取包含 unicode 的 XML 文件
【发布时间】：2014-07-21 15:47:14
【问题描述】：

我正在尝试使用 ElementTree 使用 Python 2.7.6 来解析来自某个服务器以 unicode 编码的 xml 文件，并将包含的数据保存在本地。

import xml.etree.ElementTree as ET

def normalize(string):
    if isinstance(string, unicode): 
        normalized_string  = unicodedata.normalize('NFKD', string).encode('ascii','ignore')
    elif isinstance(string, str):
        normalized_string  = string
    else:
        print "no string"
        normalized_string  = string

    normalized_string  = ''.join(e for e in normalized_string if e.isalnum())
    return normalized_string

tree = ET.parse('test.xml')
root = tree.getroot()

for element in root:
    value = element.find('value').text
    filename = normalize(element.find('name').text.encode('utf-8')) + '.txt'
    target = open(filename, 'a')
    target.write(value + '\n')
    target.close()

我正在解析的文件的结构类似于以下，我在本地保存为test.xml：

<data> 
<product><name>Something with a space</name><value>10</value> </product>
<product><name>Jakub Šlemr</name><value>12</value></product>
<product><name>Something with: a colon</name><value>11</value></product>
</data>

上面的代码有多个问题，我想解决：

Unicode 字符Š 没有被这段代码很好地消化。编辑：这已解决，部分原因是文件编码错误。
我想避免在文件名中使用特殊字符，例如空格和冒号。预处理这些的最佳方法是什么？我根据Remove all special characters, punctuation and spaces from string 和Convert a Unicode string to a string in Python (containing extra symbols) 的答案构建了一个normalize 函数。这是一种可行的方法吗？
假设每个element 都有一个名为value 的条目，element.find('value').text 是访问存储在 xml 文档中的值的最佳方式吗？

【问题讨论】：

标签： xml python-2.7 unicode elementtree

【解决方案1】：

element.find('value').text 中的值是 unicode 对象。当您将它们与 '.txt' 等 ascii 字符串对象一起附加时，它们会与所需的转换一起连接。

在序列化它们之前，您不能打印或存储 unicode 对象。如果您不明确地这样做，Python 将使用默认编码设置隐式地这样做。默认编码是 ASCII，它只支持非常有限的字符集，导致UnicodeEncodeError 任何包含非 ASCII 字符的输入数据。

我建议您使用适合您的解决方案的编解码器，使用encode() 方法将您的 unicode 对象显式编码为字符串。例如，如果您想将文本元素编码为UTF-8 编码字符串，请调用：

element.find('value').text.encode('utf-8')

另外，检查 XML 中的编码属性是否设置正确。错误的编码很可能是解析错误的原因。

【讨论】：

几乎不值得一提的是'name' 是有问题的标签，而不是'value'。由于我想将名称规范化为不包含冒号等；我引入了一个函数，它将名称标准化为字母数字 ascii - 在这种情况下我需要显式编码吗？还有：是的，文件编码确实关闭了。
这个答案是错误的。在 Python 2.7 中连接 Unicode 和字符串时，它会将字符串上转换为 Unicode；它不会将 Unicode 下转换为 ascii。