【问题标题】:Write ElementTree directly to zip with utf-8 encoding使用 utf-8 编码将 ElementTree 直接写入 zip
【发布时间】:2020-06-30 12:01:03
【问题描述】:

我想修改大量的 XML。它们存储在 ZIP 文件中。源 XML 是 utf-8 编码的(至少对于 Linux 上的 file 工具的猜测)并且具有正确的 XML 声明: <?xml version='1.0' encoding='UTF-8'?>.

目标 ZIP 和其中包含的 XML 也应该有正确的 XML 声明。然而,(至少对我而言)最明显的方法(使用ElementTree.tostring)失败了。

这是一个独立的示例,应该可以开箱即用。 简短的演练:

  • 进口
  • 准备工作(创建src.zip,这些ZIP在我的实际应用中是给定的)
  • 程序的实际工作(修改XML),从# read XMLs from zip开始

请重点关注下半部分,尤其是# APPROACH 1APPROACH 2APPROACH 3

import os
import tempfile
import zipfile
from xml.etree.ElementTree import Element, parse

src_1 = os.path.join(tempfile.gettempdir(), "one.xml")
src_2 = os.path.join(tempfile.gettempdir(), "two.xml")
src_zip = os.path.join(tempfile.gettempdir(), "src.zip")
trgt_appr1_zip = os.path.join(tempfile.gettempdir(), "trgt_appr1.zip")
trgt_appr2_zip = os.path.join(tempfile.gettempdir(), "trgt_appr2.zip")
trgt_appr3_zip = os.path.join(tempfile.gettempdir(), "trgt_appr3.zip")

# file on hard disk that must be used due to ElementTree insufficiencies
tmp_xml_name = os.path.join(tempfile.gettempdir(), "curr_xml.tmp")

# prepare src.zip
tree1 = ElementTree(Element('hello', {'beer': 'good'}))
tree1.write(os.path.join(tempfile.gettempdir(), "one.xml"), encoding="UTF-8", xml_declaration=True)
tree2 = ElementTree(Element('scnd', {'äkey': 'a value'}))
tree2.write(os.path.join(tempfile.gettempdir(), "two.xml"), encoding="UTF-8", xml_declaration=True)

with zipfile.ZipFile(src_zip, 'a') as src:
    with open(src_1, 'r', encoding="utf-8") as one:
        string_representation = one.read()
    # write to zip
    src.writestr(zinfo_or_arcname="one.xml", data=string_representation.encode("utf-8"))
    with open(src_2, 'r', encoding="utf-8") as two:
        string_representation = two.read()
    # write to zip
    src.writestr(zinfo_or_arcname="two.xml", data=string_representation.encode("utf-8"))
os.remove(src_1)
os.remove(src_2)

# read XMLs from zip
with zipfile.ZipFile(src_zip, 'r') as zfile:

    updated_trees = []

    for xml_name in zfile.namelist():

        curr_file = zfile.open(xml_name, 'r')
        tree = parse(curr_file)
        # modify tree
        updated_tree = tree
        updated_tree.getroot().append(Element('new', {'newkey': 'new value'}))
        updated_trees.append((xml_name, updated_tree))

    for xml_name, updated_tree in updated_trees:

        # write to target file
        with zipfile.ZipFile(trgt_appr1_zip, 'a') as trgt1_zip, zipfile.ZipFile(trgt_appr2_zip, 'a') as trgt2_zip, zipfile.ZipFile(trgt_appr3_zip, 'a') as trgt3_zip:

            #
            # APPROACH 1 [DESIRED, BUT DOES NOT WORK]: write tree to zip-file
            # encoding in XML declaration missing
            #
            # create byte representation of elementtree
            byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml')
            # write XML directly to zip
            trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)

            #
            # APPROACH 2 [WORKS IN THEORY, BUT DOES NOT WORK]: write tree to zip-file
            # encoding in XML declaration is faulty (is 'utf8', should be 'utf-8' or 'UTF-8')
            #
            # create byte representation of elementtree
            byte_representation = tostring(element=updated_tree.getroot(), encoding='utf8', method='xml')
            # write XML directly to zip
            trgt2_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)

            #
            # APPROACH 3 [WORKS, BUT LACKS PERFORMANCE]: write to file, then read from file, then write to zip
            #
            # write to file
            updated_tree.write(tmp_xml_name, encoding="UTF-8", method="xml", xml_declaration=True)
            # read from file
            with open(tmp_xml_name, 'r', encoding="utf-8") as tmp:
                string_representation = tmp.read()
            # write to zip
            trgt3_zip.writestr(zinfo_or_arcname=xml_name, data=string_representation.encode("utf-8"))

    os.remove(tmp_xml_name)

APPROACH 3 有效,但它比其他两个更占用资源。

APPROACH 2 是我可以使用实际 XML 声明编写 ElementTree 对象的唯一方法 - 然后结果证明它是无效的(utf8 而不是 UTF-8/utf-8)。

APPROACH 1 是最需要的——但在管道稍后的读取过程中失败,因为缺少 XML 声明。

问题:我怎样才能摆脱先将整个 XML 写入磁盘,然后才读取它,将其写入 zip 并在完成 zip 后将其删除?我错过了什么?

【问题讨论】:

  • 您可能可以使用io.BytesIO 对象。类似于方法 3,但没有磁盘访问权限。
  • 是的,原来是解决方案,谢谢!

标签: python python-3.x utf-8 python-3.6 elementtree


【解决方案1】:

您可以使用io.BytesIO 对象。 这允许使用ElementTree.write,同时避免将树导出到磁盘:

import zipfile
from io import BytesIO
from xml.etree.ElementTree import ElementTree, Element

tree = ElementTree(Element('hello', {'beer': 'good'}))
bio = BytesIO()
tree.write(bio, encoding='UTF-8', xml_declaration=True)
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
    z.writestr('test.xml', bio.getvalue())

如果您使用的是 Python 3.6 或更高版本,还有一个更短的解决方案: 你可以从ZipFile对象中得到一个可写的文件对象,你可以把它传递给ElementTree.write

import zipfile
from xml.etree.ElementTree import ElementTree, Element

tree = ElementTree(Element('hello', {'beer': 'good'}))
with zipfile.ZipFile('/tmp/test.zip', 'w') as z:
    with z.open('test.xml', 'w') as f:
        tree.write(f, encoding='UTF-8', xml_declaration=True)

这还有一个好处是您不必在内存中存储树的多个副本,这可能是大型树的相关问题。

【讨论】:

  • 第二部分正是我想要的。因为我只能在我也对您接受的其他一些答案投赞成票后才能投赞成票。谢谢!
  • @Andreas 我很高兴听到这个答案有帮助。我也很欣赏这些赞成票(当有人“让我开心”时,我知道说谢谢的冲动)——如果你认为其他答案也有用,那没有错,但你也应该知道该网站有机器人检测和撤消批量投票(无论是向上还是向下)。我仍然很高兴这几行代码很有帮助。
【解决方案2】:

方法一中唯一真正缺少的是 XML 声明头。对于ElementTree.write(...),您可以使用 xml_declaration,不幸的是,对于您的版本,这在 ElementTree.tostring 中尚不可用。

从 Python 3.8 开始,ElementTree.tostring 方法确实有一个 xml_declaration 参数,请参阅: https://docs.python.org/3.8/library/xml.etree.elementtree.html

即使您在使用 Python 3.6 时无法使用该实现,您也可以轻松地将 3.8 实现复制到您自己的 Python 文件中:

import io

def tostring(element, encoding=None, method=None, *,
             xml_declaration=None, default_namespace=None,
             short_empty_elements=True):
    """Generate string representation of XML element.
    All subelements are included.  If encoding is "unicode", a string
    is returned. Otherwise a bytestring is returned.
    *element* is an Element instance, *encoding* is an optional output
    encoding defaulting to US-ASCII, *method* is an optional output which can
    be one of "xml" (default), "html", "text" or "c14n", *default_namespace*
    sets the default XML namespace (for "xmlns").
    Returns an (optionally) encoded string containing the XML data.
    """
    stream = io.StringIO() if encoding == 'unicode' else io.BytesIO()
    ElementTree(element).write(stream, encoding,
                               xml_declaration=xml_declaration,
                               default_namespace=default_namespace,
                               method=method,
                               short_empty_elements=short_empty_elements)
    return stream.getvalue()

(见https://github.com/python/cpython/blob/v3.8.0/Lib/xml/etree/ElementTree.py#L1116

在这种情况下,您可以简单地使用方法一:

# create byte representation of elementtree
byte_representation = tostring(element=updated_tree.getroot(), encoding='UTF-8', method='xml', xml_declaration=True)
# write XML directly to zip
trgt1_zip.writestr(zinfo_or_arcname=xml_name, data=byte_representation)

【讨论】:

  • 我明白了,测试了它,发现它不起作用。
  • 奇数。我从字面上拿了你的例子,添加了这段代码并且它起作用了。您确定您不会不小心仍在使用 ElementTree 的 tostring 吗? lenz 在他的回复中也使用了同样的原则,只是他的建议是后来提出的......
  • 哦,我以为您建议对已安装的 ElementTree 实现进行猴子修补,但您是对的,这与使用 BytesIO/StringIO 对象的想法相同。
猜你喜欢
  • 2012-04-20
  • 2015-01-27
  • 2023-03-24
  • 1970-01-01
  • 2014-07-01
  • 2023-03-27
  • 1970-01-01
  • 2011-07-07
  • 1970-01-01
相关资源
最近更新 更多