转义字符串以在 XML 中使用答案

【问题标题】：Escaping strings for use in XML转义字符串以在 XML 中使用
【发布时间】：2010-12-05 12:37:45
【问题描述】：

我正在使用 Python 的 xml.dom.minidom 创建 XML 文档。（逻辑结构 -> XML 字符串，而不是相反。）

如何让它转义我提供的字符串，这样它们就不会弄乱 XML？

【问题讨论】：

任何 XML DOM 序列化程序都会在字符数据发出时适当地转义......这就是 DOM 操作的目的，以防止您不得不弄脏标记。

【解决方案1】：

这样的？

>>> from xml.sax.saxutils import escape
>>> escape("< & >")   
'&lt; &amp; &gt;'

【讨论】：

正是我想要的。我的大部分 XML 处理都是使用 lxml 完成的，我想知道导入（但）另一个 XML 模块是否会被污染？ lxml中是否有等价物？（似乎找不到。）
这不处理引号的转义。
>>> from xml.sax.saxutils import quoteattr >>> quoteattr('value contains "a double-quote \' and an apostrophe') '"value contains "双引号 \' 和撇号"'
这将导致现有的转义字符格式错误。例如，&& 变成 &&
回复：“这将导致现有的转义字符格式错误” - 这是错误的。现有的转义不会变成畸形，而是双重转义。这是预期且正确的行为：如果您的输入同时包含转义和未转义的此类字符，那么它要么是无效输入，要么您希望逐字显示转义的输入，如文本“在 HTML 中，& 使用 & 编码”，最后的“&amp”应该以这种形式显示给用户。这里需要双重转义。

【解决方案2】：

你的意思是你做了这样的事情：

from xml.dom.minidom import Text, Element

t = Text()
e = Element('p')

t.data = '<bar><a/><baz spam="eggs"> & blabla &entity;</>'
e.appendChild(t)

然后你会得到很好的转义 XML 字符串：

>>> e.toxml()
'<p>&lt;bar&gt;&lt;a/&gt;&lt;baz spam=&quot;eggs&quot;&gt; &amp; blabla &amp;entity;&lt;/&gt;</p>'

【讨论】：

您将如何处理文件？例如 from xml.dom.minidom import parse, parseString dom1 = parse('Test-bla.ddf') (example from docs.python.org/3/library/xml.dom.minidom.html)

【解决方案3】：

如果你不想导入另一个项目并且你已经有cgi，你可以使用这个：

>>> import cgi
>>> cgi.escape("< & >")
'&lt; &amp; &gt;'

但是请注意，这段代码的易读性会受到影响 - 您可能应该将它放在一个函数中以更好地描述您的意图：（并在您使用它时为其编写单元测试；）

def xml_escape(s):
    return cgi.escape(s) # escapes "<", ">" and "&"

【讨论】：

另外值得注意的是，这个 API 现在已经被弃用了
您可以使用 html.escape("") 代替这个已弃用的函数

【解决方案4】：

xml.sax.saxutils 不会转义引号字符 (")

所以这是另一个：

def escape( str ):
    str = str.replace("&", "&amp;")
    str = str.replace("<", "&lt;")
    str = str.replace(">", "&gt;")
    str = str.replace("\"", "&quot;")
    return str

如果你查一下，那么 xml.sax.saxutils 只会替换字符串

【讨论】：

可能还想转义单引号字符，即。 '
最好避免使用关键字str作为变量名。
你忘了str = str.replace("'", "&apos;")。
另外，str = str.replace("\"", "&quot;") 的替代品是 str = str.replace('"', "&quot;")，我认为它更具可读性，因为反斜杠 (\) 在我看来不合适。
如果你不从这里复制粘贴，你应该注意到第一个替换是和号（“&”）。如果它不是第一个语句，您将替换其他语句的＆符号...

【解决方案5】：

xml.sax.saxutils.escape 默认只转义&、< 和>，但它确实提供了一个entities 参数来额外转义其他字符串：

from xml.sax.saxutils import escape

def xmlescape(data):
    return escape(data, entities={
        "'": "&apos;",
        "\"": "&quot;"
    })

xml.sax.saxutils.escape 内部使用str.replace()，因此您也可以跳过导入并编写自己的函数，如 MichealMoser 的答案所示。

【讨论】：

【解决方案6】：

xml_special_chars = {
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
}

xml_special_chars_re = re.compile("({})".format("|".join(xml_special_chars)))

def escape_xml_special_chars(unescaped):
    return xml_special_chars_re.sub(lambda match: xml_special_chars[match.group(0)], unescaped)

所有的魔法都发生在re.sub()：参数repl 不仅接受字符串，还接受函数。

【讨论】：

【解决方案7】：

Andrey Vlasovskikh 接受的答案是对 OP 的最完整答案。但是这个主题出现在python escape xml 的最频繁搜索中，我想提供所讨论的三种解决方案的时间比较在本文中，同时提供了我们选择部署的第四个选项，因为它提供了增强的性能。

这四个都依赖于原生 Python 数据处理或 Python 标准库。解决方案按性能从最慢到最快的顺序提供。

选项 1 - 正则表达式

此解决方案使用 python 正则表达式库。它产生最慢的性能：

import re
table = {
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
}
pat = re.compile("({})".format("|".join(table)))

def xmlesc(txt):
    return pat.sub(lambda match: table[match.group(0)], txt)

>>> %timeit xmlesc('<&>"\'')
1.48 µs ± 1.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

仅供参考：µs 是微秒的符号，即百万分之一秒。另一个实现的完成时间以纳秒 (ns) 为单位，即十亿分之一秒。

选项 2 -- xml.sax.saxutils

此解决方案使用 python xml.sax.saxutils 库。

from xml.sax.saxutils import escape
def xmlesc(txt):
    return escape(txt, entities={"'": "&apos;", '"': "&quot;"})

>>> %timeit xmlesc('<&>"\'')
832 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

选项 3 - str.replace

此解决方案使用字符串replace() 方法。在底层，它实现了与 python 的xml.sax.saxutils 类似的逻辑。 saxutils 代码有一个 for 循环，它会消耗一些性能，使这个版本稍微快一些。

def xmlesc(txt):
    txt = txt.replace("&", "&amp;")
    txt = txt.replace("<", "&lt;")
    txt = txt.replace(">", "&gt;")
    txt = txt.replace('"', "&quot;")
    txt = txt.replace("'", "&apos;")
    return txt

>>> %timeit xmlesc('<&>"\'')
503 ns ± 0.725 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

选项 4 - str.translate

这是最快的实现。它使用字符串translate() 方法。

table = str.maketrans({
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
})
def xmlesc(txt):
    return txt.translate(table)

>>> %timeit xmlesc('<&>"\'')
352 ns ± 0.177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

【讨论】：