【问题标题】:ElementTree does not write data in UTF-8ElementTree 不以 UTF-8 写入数据
【发布时间】:2015-01-27 07:46:51
【问题描述】:

我正在从数据库中提取数据并尝试从这些数据中创建一个 XML 文件。数据采用 UTF-8 格式,可以包含 ášč 等字符。这是代码:

import xml.etree.cElementTree as ET

tree = ET.parse(metadata_file)
# ..some commands that alter the XML..
tree.write(metadata_file, encoding="UTF-8")

写入数据时,脚本失败:

Traceback (most recent call last):
  File "get-data.py", line 306, in <module>
    main()
  File "get-data.py", line 303, in main
    tree.write(metadata_file, encoding="UTF-8")
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 32: ordinal not in range(128)

防止这种情况的唯一方法是解码写入 XML 文件的数据:

text = text.decode('utf-8')

但是结果文件将包含例如&amp;#269; 而不是 č。知道如何将数据写入文件并将其保存为 UTF-8 吗?

编辑:

这是脚本所做的示例:

]$ echo "<data></data>" > test.xml
]$ cat test.xml
<data></data>
]$ python
Python 2.7.5 (default, Nov  3 2014, 14:33:39)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.cElementTree as ET
>>> tree = ET.parse('./test.xml')
>>> root = tree.getroot()
>>> new = ET.Element("elem")
>>> new.text = "á, š, or č"
>>> root.append(new)
>>> tree.write('./text.xml', encoding="UTF-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

【问题讨论】:

标签: python xml utf-8 elementtree


【解决方案1】:

啊,终于明白了,这是正确的做法:

]$ echo "<data></data>" > text.xml
]$ python
Python 2.7.5 (default, Nov  3 2014, 14:26:24)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.cElementTree as ET
>>>
>>> tree = ET.parse('./test.xml')
>>> root = tree.getroot()
>>> new = ET.Element("elem")
>>> new.text = "á, š, or č".decode('utf-8')
>>> root.append(new)
>>> tree.write('./textout.xml', encoding="UTF-8")
>>>
>>> exit()
]$ cat textout.xml
<?xml version='1.0' encoding='UTF-8'?>
<data><elem>á, š, or č</elem></data>

在我最初的解决方案中,我在 write() 中将其编码为 UTF-8,但没有使用 .decode('utf-8') 对其进行解码。

【讨论】:

    【解决方案2】:

    问题没有明确metadata_file是什么对象。

    如果使用普通文件对象,没有错误,输出如预期:

    >>> import xml.etree.cElementTree as ET
    >>> stream = open('test.xml', 'wb+')
    >>> stream.write(u"""\
    ... <root>characters such as á, š, or č.</root>
    ... """.encode('utf-8'))
    >>> stream.seek(0)
    >>> tree = ET.parse(stream)
    >>> stream.close()
    >>> ET.tostring(tree.getroot())
    '<root>characters such as &#225;, &#353;, or &#269;.</root>'
    >>> stream = open('test.xml', 'w')
    >>> tree.write(stream, encoding='utf-8', xml_declaration=True)
    >>> stream.close()
    >>> open('test.xml').read()
    "<?xml version='1.0' encoding='utf-8'?>\n<root>characters such as \xc3\xa1, \xc5\xa1, or \xc4\x8d.</root>"
    

    【讨论】:

    • 这是一个常规文件:metadata_file="./metadata.xml"。我不会像打开文件那样打开文件,我只是将树粘贴在问题中并对其进行修改,然后将其写出来。
    • @mart1n。在这种情况下,您没有显示产生错误的实际代码。
    • 好的,我复制了我的脚本在这里所做的事情:pastebin.com/v8sHDDgv 尽管使用 UTF-8 编码,但仍然抛出回溯。
    • @mart1n:pastebin 代码应该在问题中;不要在对答案的评论中“隐藏”它。
    • @mzjn 很公平,我将其添加到原始问题中。
    猜你喜欢
    • 2012-04-20
    • 1970-01-01
    • 2020-06-30
    • 2023-03-24
    • 2012-06-24
    • 1970-01-01
    • 1970-01-01
    • 2020-08-09
    • 2013-11-19
    相关资源
    最近更新 更多