在 Python 中转义 HTML 的最简单方法是什么？答案

【问题标题】：What's the easiest way to escape HTML in Python?在 Python 中转义 HTML 的最简单方法是什么？
【发布时间】：2010-11-06 21:31:03
【问题描述】：

cgi.escape 似乎是一种可能的选择。它运作良好吗？有什么被认为更好的吗？

【问题讨论】：

标签： python html

【解决方案1】：

cgi.escape 很好。它逃脱了：

&lt; 到 &lt;
&gt; 到 &gt;
&amp; 到 &amp;

这对于所有 HTML 来说已经足够了。

编辑：如果您有非 ascii 字符，您还想转义，以便包含在另一个使用不同编码的编码文档中，就像 Craig 说的那样，只需使用：

data.encode('ascii', 'xmlcharrefreplace')

别忘了先将data解码为unicode，使用任何编码。

但是根据我的经验，如果您从一开始就一直使用unicode，那么这种编码是没有用的。只需在末尾编码为文档标题中指定的编码（utf-8 以获得最大兼容性）。

例子：

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

另外值得注意（感谢 Greg）是额外的 quote 参数 cgi.escape 需要。将其设置为 True 时，cgi.escape 也会转义双引号字符 (")，因此您可以在 XML/HTML 属性中使用结果值。

编辑：请注意，cgi.escape 在 Python 3.2 中已被弃用，取而代之的是 html.escape，除了 quote 默认为 True。

【讨论】：

在 HTML 属性值中使用文本时，还应考虑添加 cgi.escape 的布尔参数来转义引号。
只是为了确定：如果我通过cgi.escape 函数运行所有不受信任的数据，是否足以防止所有（已知）XSS 攻击？
@Tomas Sedovic：取决于在其中运行 cgi.escape 后将文本放置在哪里。如果放置在根 HTML 上下文中，那么是的，你是完全安全的。
像 {{Measures 12 Ω"H x 17 5/8"W x 8 7/8"D. Imported.}} 这样的输入呢，这不是 ascii，所以 encode() 会抛出异常在你。
@Andrew Kolesnikov：你试过了吗？ cgi.escape(yourunicodeobj).encode('ascii', 'xmlcharrefreplace') == '{{Measures 12 &#937;"H x 17 5/8"W x 8 7/8"D. Imported.}}' -- 如您所见，表达式返回 ascii 字节串，所有非 ascii unicode 字符均使用 xml 字符引用表进行编码。

【解决方案2】：

cgi.escape 在转义 HTML 标签和字符实体的有限意义上应该很好地转义 HTML。

但您可能还必须考虑编码问题：如果您要引用的 HTML 在特定编码中包含非 ASCII 字符，那么您还必须注意在引用时合理地表示这些字符。也许您可以将它们转换为实体。否则，您应该确保在“源”HTML 和嵌入它的页面之间完成正确的编码转换，以避免损坏非 ASCII 字符。

【讨论】：

【解决方案3】：

在 Python 3.2 中引入了一个新的 html 模块，用于从 HTML 标记中转义保留字符。

只有一个功能escape():

>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x &gt; 2 &amp;&amp; x &lt; 7 single quote: &#x27; double quote: &quot;'

【讨论】：

quote=True 呢？
@SalmanAbbas 你害怕引号不会被转义吗？请注意，html.escape() 默认情况下会转义引号（相比之下，cgi.quote() 不会 - 并且只会转义双引号，如果这样的话）。因此，我必须显式设置一个可选参数以使用html.escape() 将某些内容注入到属性中，即使其对属性不安全：t = '" onclick="alert()'; t = html.escape(t, quote=False); s = f'<a href="about.html" class="{t}">foo</a>'
@maxschlepzig 我认为萨尔曼是在说escape() 不足以使属性安全。换句话说，这是不安全的：<a href=" {{ html.escape(untrusted_text) }} ">
@pianoJames，我明白了。我认为检查链接值是域特定的语义验证。不是像转义这样的词汇。除了内联 Java Script，您真的不希望在没有进一步的 URL 特定验证的情况下从不受信任的用户输入创建链接（例如，因为垃圾邮件发送者）。防止 href 等属性中的内联 Java 脚本的一种简单方法是设置禁止它的内容安全策略。
@pianoJames 是安全的，因为html.escape 确实会转义单引号和双引号。

【解决方案4】：

如果您希望在 URL 中转义 HTML：

这可能不是 OP 想要的（问题没有明确指出要在哪个上下文中使用转义），但是 Python 的本机库 urllib 有一种方法可以转义需要包含的 HTML 实体安全地在 URL 中。

以下是一个例子：

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

Find docs here

【讨论】：

这是错误的转义方式；我们正在寻找HTML escapes，而不是URL encoding。
尽管如此 - 这是我真正想要的 ;-)
在 Python 3 中，这已移至 urllib.parse.quote。 docs.python.org/3/library/urllib.parse.html#url-quoting

【解决方案5】：

`cgi.escape`扩展

此版本改进了cgi.escape。它还保留了空格和换行符。返回一个unicode 字符串。

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

例如

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'

【讨论】：

【解决方案6】：

对于 Python 2.7 中的遗留代码，可以通过BeautifulSoup4：

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'

【讨论】：

【解决方案7】：

不是最简单的方法，但仍然很简单。与 cgi.escape 模块的主要区别 - 如果您的文本中已经包含 &amp;，它仍然可以正常工作。正如您从 cmets 看到的那样：

cgi.escape 版本

def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s

正则表达式版本

QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&gt;',
        '>': '&lt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)

【讨论】：

【解决方案8】：

还有优秀的markupsafe package。

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

markupsafe 包经过精心设计，可能是最通用和 Pythonic 的转义方式，恕我直言，因为：

返回 (Markup) 是从 unicode 派生的类（即isinstance(escape('str'), unicode) == True
它可以正确处理 unicode 输入
它适用于 Python（2.6、2.7、3.3 和 pypy）
它尊重对象的自定义方法（即具有__html__ 属性的对象）和模板重载 (__html_format__)。

【讨论】：

【解决方案9】：

没有库，纯python，安全地将文本转义为html文本：

text.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;'
        ).replace('\'','&#39;').replace('"','&#34;').encode('ascii', 'xmlcharrefreplace')

【讨论】：

您的排序错误，&amp;lt; 将被转义到&amp;lt;
@jason s 感谢您的修复！

cgi.escape扩展

例如

`cgi.escape`扩展