如何在java中将带有html编码的字符串转换为Unicode答案

【问题标题】：How to convert string with html encoding to Unicode in java如何在java中将带有html编码的字符串转换为Unicode
【发布时间】：2019-06-18 13:44:22
【问题描述】：

enter code here我的 html 编码有问题。我有一个带有 html 编码的字符串，如下所示：

&ETH;&#7897;t nhi&ecirc;n, &#7903; g&#7889;c T&acirc;y B&#7855;c v&#259;ng v&#7859;ng c&oacute; ti&#7871;ng v&oacute; ng&#7921;a d&#7891;n d&#7853;p.

我想将此字符串转换为 Unicode。它的输出（实际值）应该是

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

我试图找到this sugest 的解决方案，但它只能帮助所有字符格式以&# 开头的字符串。以&xxxx开头的字符，通过这个page我得到它的编码是html编码，但我的输入字符串是转换HTML实体（命名）和HTML实体（十进制）的组合。

谁能给我一个建议？如果你能在没有任何额外的java库的情况下解决它是最好的。

提前致谢！

[UPDATE] 我通过使用Apache library 解决了我的问题：

String encodeString = "&ETH;&#7897;t nhi&ecirc;n, &#7903; g&#7889;c T&acirc;y B&#7855;c v&#259;ng v&#7859;ng c&oacute; ti&#7871;ng v&oacute; ng&#7921;a d&#7891;n d&#7853;p.";
    String unEncodeString = StringEscapeUtils.unescapeHtml4(encodeString);
    System.out.println("OUTPUT : " + unEncodeString);

=====> OUTPUT : Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

【问题讨论】：

Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?的可能重复
谢谢@AnubianNoob，我用你的建议解决了我的问题，但另外我想只用Java的标准库来解决它。因为在stackoverflow.com/questions/20799512/… 中提出建议，我可以转换前缀为“$#”的字符串。你能帮忙吗？非常感谢！

标签： java unicode encoding

【解决方案1】：

为此使用 Apache Commons StringEscapeUtils.unescapeHtml(string)。

参考：Java: How to unescape HTML character entities in Java?

【讨论】：

【解决方案2】：

maven:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.9.2</version>
</dependency>    

/**
 * https://stackoverflow.com/a/6766497/8356718
 */
public static String toDecimal(String text) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < text.length(); i++) {
        int codePoint = text.codePointAt(i);
        // Skip over the second char in a surrogate pair
        if (codePoint > 0xffff) {
            i++;
        }
        sb.append(String.format("&#%s;", codePoint));
    }
    return sb.toString();
}

public static Document getNoPrettyDoc(String html) {
    Document doc = Jsoup.parse(html);
    doc.outputSettings().prettyPrint(false);
    return doc;
}

public static String toDecimalHtml(String html) {
    Document doc = getNoPrettyDoc(html);
    toDecimalHtml(doc);
    return doc.body().html().trim().replace("&amp;", "&");
}

private static void toDecimalHtml(Node node) {
    for (int i = 0; i < node.childNodes().size(); ) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#text")) {
            TextNode text = (TextNode) child;
            String str = text.getWholeText();
            text.text(toDecimal(str));
            if (child.childNodes().size() <= 0) {
                i++;
            }
        } else {
            if (child.childNodes().size() > 0) {
                toDecimalHtml(child);
            }
            i++;
        }
    }
}

您可能需要先删除：\n \r \t

【讨论】：

【解决方案3】：

您可能需要尝试使用此方法进行编码和解码。

用于编码

URLEncoder.encode("<#> Test", "UTF-8").replace("+", "%20");

用于解码

URLDecoder.decode("%3C%23%3E%20Test");

【讨论】：

【解决方案4】：

在 Java 中，对于 unicode 字符串文字，您需要在数字前加上 \u。

例如：

System.out.println("\u0042");
System.out.println("\u00AF\\_(\u30C4)_/\u00AF");

打印：

B
¯\_(ツ)_/¯

你想要的是：

System.out.println("\u00D0\u1ED9t nhi\u00EAn, \u1EDF g\u1ED1c T\u00E2y B\u1EAFc v\u0103ng v\u1EB3ng c\u00F3 ti\u1EBFng v\u00F3 ng\u1EF1a d\u1ED3n d\u1EADp.\n");

打印：

Ðột nhiên, ở gốc Tây Bắc văng vẳng có tiếng vó ngựa dồn dập.

编辑：Apache commons 是最好的方法：

StringEscapeUtils.unescapeHtml4();.

【讨论】：

感谢您的回答，但我的意思是如何将字符串“Ðột”转换为“Đột”字符串。我有现有的输入，我想得到上面的输出。能否请您提供更多帮助？
没有 Apache 库有什么办法吗？我想在没有附加库的情况下修复它...