【问题标题】：What is the recommended way to escape HTML symbols in plain Java?在纯 Java 中转义 HTML 符号的推荐方法是什么？
【发布时间】：2010-11-18 21:55:46
【问题描述】：

在以纯 Java 代码输出 HTML 时，是否有推荐的方法来转义 <、>、" 和 & 字符？（除了手动执行以下操作）。

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...

【问题讨论】：

请注意，如果您要输出到未引用的 HTML 属性，其他字符（如空格、制表符、退格等）可能允许攻击者在没有列出任何字符的情况下引入 javascript 属性。有关更多信息，请参阅 OWASP XSS 预防备忘单。
顺便说一句，在此代码中，您应该在“source.replace("&", "&").replace("<", "<");

标签： java html escaping

【解决方案1】：

StringEscapeUtils 来自Apache Commons Lang：

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

对于version 3：

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

【讨论】：

虽然StringEscapeUtils 很好，但如果您希望避免 HTML/XML 空白规范化，它不会为属性正确转义空白。请参阅我的答案以获取更多详细信息。
上面的例子坏了。现在使用 escapeHtml4() 方法。
Guava 粉丝见下方okranz's answer。
如果网页使用 UTF-8 编码，那么我们只需要 Guava 的 htmlEscaper，它只转义以下五个 ASCII 字符：'"&。Apache 的 escapeHtml() 还替换非 ASCII 字符，包括重音符号UTF-8 网页似乎没有必要这样做？
现在在 commons-lang3 中已弃用。已移至commons.apache.org/proper/commons-text

【解决方案2】：

Apache Commons 的替代方案：使用 Spring 的 HtmlUtils.htmlEscape(String input) 方法。

【讨论】：

谢谢。我使用了它（而不是 apache-commons 2.6 中的 StringEscapeUtils.escapeHtml()），因为它保留了俄语字符。
很高兴知道这一点。 TBH 这些天我给 Apache 的东西一个很大的距离。
我也用过，汉字也是原样。
而且它还对撇号进行编码，所以它实际上是有用的，不像apache StringEscapeUtils

【解决方案3】：

不错的短方法：

public static String escapeHTML(String s) {
    StringBuilder out = new StringBuilder(Math.max(16, s.length()));
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
            out.append("&#");
            out.append((int) c);
            out.append(';');
        } else {
            out.append(c);
        }
    }
    return out.toString();
}

基于https://stackoverflow.com/a/8838023/1199155（此处缺少放大器）。 if子句中检查的四个字符只有128以下，根据http://www.w3.org/TR/html4/sgml/entities.html

【讨论】：

不错。它不使用编码的“html 版本”（例如：“á”将是“á”而不是“á”），但由于数字版本即使在 IE7 中也可以使用，我想我不会不得不担心。谢谢。
当 OP 要求转义 4 个相关字符时，为什么要对所有字符进行编码？你在浪费 CPU 和内存。
你忘记了撇号。因此，人们可以在任何使用此代码转义属性值的地方注入不带引号的属性。
这在字符串包含代理对时不起作用，例如表情符号。

【解决方案4】：

Apache Commons Lang library 有更新版本，它使用不同的包名称 (org.apache.commons.lang3)。 StringEscapeUtils 现在有不同的静态方法来转义不同类型的文档 (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html)。所以要转义 HTML 4.0 版字符串：

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;

String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

【讨论】：

不幸的是，HTML 5 不存在任何内容，Apache 文档也没有指定在 HTML 5 中使用 escapeHtml4 是否合适。

【解决方案5】：

对于使用 Google Guava 的用户：

import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);

【讨论】：

【解决方案6】：

在 android（API 16 或更高版本）上，您可以：

Html.escapeHtml(textToScape);

或者对于较低的 API：

TextUtils.htmlEncode(textToScape);

【讨论】：

另请参阅我的my question，了解这两者之间的区别。 (@Muz)

【解决方案7】：

注意这一点。 HTML 文档中有许多不同的“上下文”：在元素内部、带引号的属性值、不带引号的属性值、URL 属性、javascript、CSS 等……您需要为每个元素使用不同的编码方法这些是为了防止跨站点脚本（XSS）。检查the OWASP XSS Prevention Cheat Sheet 以获取有关这些上下文中的每一个的详细信息。您可以在 OWASP ESAPI 库中找到每个上下文的转义方法 -- https://github.com/ESAPI/esapi-java-legacy。

【讨论】：

感谢您指出您希望对输出进行编码的 context 非常重要。术语“编码”也是比“转义”更合适的动词。转义意味着某种特殊的技巧，而不是“我如何编码这个字符串：XHTML 属性/SQL 查询参数/PostScript 打印字符串/CSV 输出字段？
'Encode' 和 'escape' 都被广泛用于描述这一点。术语“转义”通常用于在处理过程是在语法相关字符之前添加“转义字符”时使用，例如用反斜杠转义引号字符 \" 术语“编码”更常用于翻译一个字符转换为不同的形式，例如 URL 编码引号字符 %22 或 HTML 实体编码为 " 或 @quot.
owasp-esapi-java.googlecode.com/svn/trunk_doc/latest/index.html。链接现在断开了
为了节省您的一些谷歌搜索，寻找编码器类static.javadoc.io/org.owasp.esapi/esapi/2.0.1/org/owasp/esapi/…

【解决方案8】：

出于某些目的，HtmlUtils:

import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &#38;
HtmlUtils.htmlEscape("&"); //gives &amp;

【讨论】：

来自 Spring HtmlUtils cmets：*
对于一组全面的字符串转义实用程序，* 考虑 Apache Commons Lang 及其 StringEscapeUtils 类。 * 我们在这里没有使用该类来避免对 Commons Lang 的运行时依赖 * 只是为了 HTML 转义。此外，Spring 的 * HTML 转义更加灵活，并且 100% 兼容 HTML 4.0。如果您已经在项目中使用了 Apache commons，那么您应该使用 apache 中的 StringEscapeUtils

【解决方案9】：

org.apache.commons.lang3.StringEscapeUtils 现在已弃用。您现在必须使用 org.apache.commons.text.StringEscapeUtils by

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>${commons.text.version}</version>
    </dependency>

【讨论】：

【解决方案10】：

虽然org.apache.commons.lang.StringEscapeUtils.escapeHtml 的@dfa 答案很好，而且我过去使用过它，但它不应该用于转义 HTML（或 XML）属性，否则空格将被规范化（意味着所有相邻的空白字符变成一个空格）。

我知道这一点是因为我的库 (JATL) 中针对未保留空格的属性提交了错误。因此我有一个插入（复制 n' 粘贴）class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content。

虽然这在过去可能没有那么重要（适当的属性转义），但考虑到 HTML5 的 data- 属性用法，它越来越引起人们的兴趣。

【讨论】：

【解决方案11】：

Java 8+ 解决方案：

public static String escapeHTML(String str) {
    return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
       "&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}

String#chars 返回字符串中字符值的IntStream。然后我们可以使用mapToObj转义字符代码大于127的字符（非ASCII字符）以及双引号（"）、单引号（'）、左尖括号（@987654328） @)、右尖括号 (>) 和 & 符号 (&)。 Collectors.joining 将Strings 连接在一起。

为了更好地处理 Unicode 字符，可以改用String#codePoints。

public static String escapeHTML(String str) {
    return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
            "&#" + c + ";" : new String(Character.toChars(c)))
       .collect(Collectors.joining());
}

【讨论】：

【解决方案12】：

大多数库都提供了转义功能，包括数百个符号和数千个非 ASCII 字符，这在 UTF-8 世界中不是您想要的。

此外，正如 Jeff Williams 所说，没有单一的“转义 HTML”选项，有多种上下文。

假设您从不使用不带引号的属性，并记住存在不同的上下文，它编写了我自己的版本：

private static final long TEXT_ESCAPE =
        1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
        DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;

// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;";
private static final int REPL_SLICES = /*  [0,   5,   10,  15, 19) */
        5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        Appendable builder, CharSequence content, long escapes) {
    try {
        int startIdx = 0, len = content.length();
        for (int i = 0; i < len; i++) {
            char c = content.charAt(i);
            long one;
            if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
            // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            // |                  | take only dangerous characters
            // | java shifts longs by 6 least significant bits,
            // | e. g. << 0b110111111 is same as >> 0b111111.
            // | Filter out bigger characters

                int index = Long.bitCount(ESCAPES & (one - 1));
                builder.append(content, startIdx, i /* exclusive */).append(
                        REPLACEMENTS,
                        REPL_SLICES >>> (5 * index) & 31,
                        REPL_SLICES >>> (5 * (index + 1)) & 31
                );
                startIdx = i + 1;
            }
        }
        builder.append(content, startIdx, len);
    } catch (IOException e) {
        // typically, our Appendable is StringBuilder which does not throw;
        // also, there's no way to declare 'if A#append() throws E,
        // then appendEscaped() throws E, too'
        throw new UncheckedIOException(e);
    }
}

考虑从Gist without line length limit复制粘贴。

UPD：正如another answer 建议的那样，> 转义是不必要的；此外，" 中的 attr='…' 也是允许的。我已经相应地更新了代码。

您可以check it out自己：

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>

<p title="&lt;&#34;I'm double-quoted!&#34;>">&lt;"Hello!"></p>
<p title='&lt;"I&#39;m single-quoted!">'>&lt;"Goodbye!"></p>

</body>
</html>

【讨论】：