如何使用纯 JavaScript 将字符转换为 HTML 实体答案

【问题标题】：How to convert characters to HTML entities using plain JavaScript如何使用纯 JavaScript 将字符转换为 HTML 实体
【发布时间】：2010-11-24 03:42:14
【问题描述】：

我有以下几点：

var text = "Übergroße Äpfel mit Würmern";

我正在寻找一个 Javascript 函数来转换文本，以便每个特殊字母都由它的 HTML 实体序列表示，如下所示：

var newText = magicFunction(text);
...
newText = "&Uuml;bergro&szlig;e &Auml;pfel mit W&uuml;rmern";

该函数应该不仅转义此示例中的字母but also all of these.

您将如何实现这一目标？有没有现成的功能？（很简单，因为首选没有框架的解决方案）

顺便说一句：是的，我见过this question，但它不能满足我的需求。

【问题讨论】：

我需要它用于另一个需要这种格式的组件。

标签： javascript escaping html-entities

【解决方案1】：

he 库是我所知道的唯一 100% 可靠的解决方案！

他由Mathias Bynens - 世界上最著名的 JavaScript 大师之一 - 编写，具有以下特点：

支持all standardized named character references
Support for unicode
与ambiguous ampersands 配合得很好

使用示例

he.encode('foo © bar ≠ baz ? qux'); 
// Output : 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'

he.decode('foo &copy; bar &ne; baz &#x1D306; qux');
// Output : 'foo © bar ≠ baz ? qux'

【讨论】：

question 是一个没有库的原生解决方案。

【解决方案2】：

这里建议的所有其他解决方案以及大多数其他执行 HTML 实体编码/解码的 JavaScript 库都会犯一些错误：

他们没有实现the full list of named character references that browsers support。例如，htmlDecode('&PrecedesSlantEqual;') 应该返回 '≼' (i.e. '\u227C')。
它们不支持正确编码星体符号。例如，htmlEncode('?') should return something like &#x1D306; or &#119558;。如果一个实现改为返回两个单独的实体（例如&#xD834;&#xDF06; 或&#55348;&#57094;），它就会被破坏。
它们不支持正确解码星体符号。 htmlDecode('&#x1D306;') 应该返回 '?' 而不是 '팆' (i.e. '\uD306')。
他们没有实现the character reference overrides table listed in the HTML Standard。例如，htmlDecode('&#x80;') 应该返回 '€' (i.e. '\u20AC')。
他们应该在一次通过中执行解码。例如，htmlDecode('&#x26;amp;') 应该返回 '&amp;'，而不是 &。

对于避免所有这些问题的强大解决方案，请使用我为此编写的名为 he 的库。来自其自述文件：

he（用于“HTML 实体”）是用 JavaScript 编写的强大的 HTML 实体编码器/解码器。它支持all standardized named character references as per HTML，处理ambiguous ampersands 和其他极端情况just like a browser would，具有广泛的测试套件，并且——与许多其他JavaScript 解决方案相反——he 可以很好地处理星体Unicode 符号。 An online demo is available.

【讨论】：

第 (1)、(3)、(4) 和 (5) 项谈论的是解码，而不是编码，并没有抓住问题的重点。 he 库无论如何都很棒，只是这个问题并不真正需要。您可以查看我的解决方案以获得简短的独立 javascript 实现（仅编码）。

【解决方案3】：

我推荐使用JS库entities。使用该库非常简单。请参阅文档中的示例：

const entities = require("entities");
//encoding
entities.escape("&#38;"); // "&#x26;#38;"
entities.encodeXML("&#38;"); // "&amp;#38;"
entities.encodeHTML("&#38;"); // "&amp;&num;38&semi;"
//decoding
entities.decodeXML("asdf &amp; &#xFF; &#xFC; &apos;"); // "asdf & ÿ ü '"
entities.decodeHTML("asdf &amp; &yuml; &uuml; &apos;"); // "asdf & ÿ ü '"

【讨论】：

【解决方案4】：

Demo on JSFiddle

这是一个很小的独立方法：

尝试在此页面上合并答案，而不使用库
适用于旧版浏览器
支持代理对（如表情符号）
应用字符覆盖（那是什么？不确定）

我不太了解 unicode，但它似乎运行良好。

// escape a string for display in html
// see also: 
// polyfill for String.prototype.codePointAt
//   https://raw.githubusercontent.com/mathiasbynens/String.prototype.codePointAt/master/codepointat.js
// how to convert characters to html entities
//     http://stackoverflow.com/a/1354491/347508
// html overrides from 
//   https://html.spec.whatwg.org/multipage/syntax.html#table-charref-overrides / http://stackoverflow.com/questions/1354064/how-to-convert-characters-to-html-entities-using-plain-javascript/23831239#comment36668052_1354098

var _escape_overrides = { 0x00:'\uFFFD',0x80:'\u20AC',0x82:'\u201A',0x83:'\u0192',0x84:'\u201E',0x85:'\u2026',0x86:'\u2020',0x87:'\u2021',0x88:'\u02C6',0x89:'\u2030',0x8A:'\u0160',0x8B:'\u2039',0x8C:'\u0152',0x8E:'\u017D',0x91:'\u2018',0x92:'\u2019',0x93:'\u201C',0x94:'\u201D',0x95:'\u2022',0x96:'\u2013',0x97:'\u2014',0x98:'\u02DC',0x99:'\u2122',0x9A:'\u0161',0x9B:'\u203A',0x9C:'\u0153',0x9E:'\u017E',0x9F:'\u0178' }; 

function escapeHtml(str){
    return str.replace(/([\u0000-\uD799]|[\uD800-\uDBFF][\uDC00-\uFFFF])/g, function(c) {
        var c1 = c.charCodeAt(0);
        // ascii character, use override or escape
        if( c1 <= 0xFF ) return (c1=_escape_overrides[c1])?c1:escape(c).replace(/%(..)/g,"&#x$1;");
        // utf8/16 character
        else if( c.length == 1 ) return "&#" + c1 + ";"; 
        // surrogate pair
        else if( c.length == 2 && c1 >= 0xD800 && c1 <= 0xDBFF ) return "&#" + ((c1-0xD800)*0x400 + c.charCodeAt(1) - 0xDC00 + 0x10000) + ";"
        // no clue .. 
        else return "";
    });
}

【讨论】：

这似乎在使用 str = ⚠️ 时失败请改用下面 KooiInc 的 encodeHTML (stackoverflow.com/a/1354489/1432181)

【解决方案5】：

我通过使用 encodeURIComponent() 而不是 escape() 解决了我的问题。

如果在 URL 中发送字符串时出现问题，这可能会为您解决问题。

试试这个短语 ("hi & % ‘")

escape() 返回

"hi%20%26%20%25%20%u2018"

请注意，%u2018 对 url 不是很友好，可能会破坏查询字符串的其余部分。

encodeURI() 返回

"hi%20&%20%25%20%E2%80%98"

请注意与号仍然存在。

encodeURIComponent()返回

"hi%20%26%20%25%20%E2%80%98"

最后，我们所有的字符都被正确编码了。

【讨论】：

【解决方案6】：

你可以使用：

function encodeHTML(str){
 var aStr = str.split(''),
     i = aStr.length,
     aRet = [];

   while (i--) {
    var iC = aStr[i].charCodeAt();
    if (iC < 65 || iC > 127 || (iC>90 && iC<97)) {
      aRet.push('&#'+iC+';');
    } else {
      aRet.push(aStr[i]);
    }
  }
 return aRet.reverse().join('');
}

此函数 HTMLEncode 非 a-z/A-Z 的所有内容。

[编辑] 一个相当老的答案。让我们添加一个更简单的字符串扩展来编码所有扩展字符：

String.prototype.encodeHTML = function () {
  return this.replace(/[\u0080-\u024F]/g, 
          function (v) {return '&#'+v.charCodeAt()+';';}
         );
}
// usage
log('Übergroße Äpfel mit Würmern'.encodeHTML());
//=> '&#220;bergro&#223;e &#196;pfel mit W&#252;rmern'

【讨论】：

这应该比使用 text.replace() 快很多。我喜欢使用 while(--i) 代替 for() 循环的方式。我假设理论是，对于大型文本/循环，快速条件测试将 Array.reverse().join('') 偏移到循环之外。否则你会使用字符串连接？
谢谢。递减循环确实比递增循环要快，这是我很久以前读过的一个优化步骤，我大部分时间都在使用它（它的代码也更少）。我不确定“反向”部分是否会影响速度增益。对于较短的字符串，使用 aRet[i] = [value] 代替 aRet.push 可能会更快（正如 stackoverflow.com/questions/614126/… 中的“olliej”很好地解释的那样）。
--i 在到达第一个字符时将等于 0。你的条件应该是while (--i >= 0) 否则你会丢失输入字符串的第一个字符。
@AliGangji，对，应该是i--，调整答案
这个样本对于 unicode 字符似乎有缺陷；请参阅stackoverflow.com/a/69588382/1432181 以获得更好的 encodeHTML() 选项

【解决方案7】：

在 bucabay 的帮助和创建我自己的函数的建议下，我创建了这个对我有用的函数。大家觉得有什么更好的解决办法吗？

if(typeof escapeHtmlEntities == 'undefined') {
        escapeHtmlEntities = function (text) {
            return text.replace(/[\u00A0-\u2666<>\&]/g, function(c) {
                return '&' + 
                (escapeHtmlEntities.entityTable[c.charCodeAt(0)] || '#'+c.charCodeAt(0)) + ';';
            });
        };

        // all HTML4 entities as defined here: http://www.w3.org/TR/html4/sgml/entities.html
        // added: amp, lt, gt, quot and apos
        escapeHtmlEntities.entityTable = {
            34 : 'quot', 
            38 : 'amp', 
            39 : 'apos', 
            60 : 'lt', 
            62 : 'gt', 
            160 : 'nbsp', 
            161 : 'iexcl', 
            162 : 'cent', 
            163 : 'pound', 
            164 : 'curren', 
            165 : 'yen', 
            166 : 'brvbar', 
            167 : 'sect', 
            168 : 'uml', 
            169 : 'copy', 
            170 : 'ordf', 
            171 : 'laquo', 
            172 : 'not', 
            173 : 'shy', 
            174 : 'reg', 
            175 : 'macr', 
            176 : 'deg', 
            177 : 'plusmn', 
            178 : 'sup2', 
            179 : 'sup3', 
            180 : 'acute', 
            181 : 'micro', 
            182 : 'para', 
            183 : 'middot', 
            184 : 'cedil', 
            185 : 'sup1', 
            186 : 'ordm', 
            187 : 'raquo', 
            188 : 'frac14', 
            189 : 'frac12', 
            190 : 'frac34', 
            191 : 'iquest', 
            192 : 'Agrave', 
            193 : 'Aacute', 
            194 : 'Acirc', 
            195 : 'Atilde', 
            196 : 'Auml', 
            197 : 'Aring', 
            198 : 'AElig', 
            199 : 'Ccedil', 
            200 : 'Egrave', 
            201 : 'Eacute', 
            202 : 'Ecirc', 
            203 : 'Euml', 
            204 : 'Igrave', 
            205 : 'Iacute', 
            206 : 'Icirc', 
            207 : 'Iuml', 
            208 : 'ETH', 
            209 : 'Ntilde', 
            210 : 'Ograve', 
            211 : 'Oacute', 
            212 : 'Ocirc', 
            213 : 'Otilde', 
            214 : 'Ouml', 
            215 : 'times', 
            216 : 'Oslash', 
            217 : 'Ugrave', 
            218 : 'Uacute', 
            219 : 'Ucirc', 
            220 : 'Uuml', 
            221 : 'Yacute', 
            222 : 'THORN', 
            223 : 'szlig', 
            224 : 'agrave', 
            225 : 'aacute', 
            226 : 'acirc', 
            227 : 'atilde', 
            228 : 'auml', 
            229 : 'aring', 
            230 : 'aelig', 
            231 : 'ccedil', 
            232 : 'egrave', 
            233 : 'eacute', 
            234 : 'ecirc', 
            235 : 'euml', 
            236 : 'igrave', 
            237 : 'iacute', 
            238 : 'icirc', 
            239 : 'iuml', 
            240 : 'eth', 
            241 : 'ntilde', 
            242 : 'ograve', 
            243 : 'oacute', 
            244 : 'ocirc', 
            245 : 'otilde', 
            246 : 'ouml', 
            247 : 'divide', 
            248 : 'oslash', 
            249 : 'ugrave', 
            250 : 'uacute', 
            251 : 'ucirc', 
            252 : 'uuml', 
            253 : 'yacute', 
            254 : 'thorn', 
            255 : 'yuml', 
            402 : 'fnof', 
            913 : 'Alpha', 
            914 : 'Beta', 
            915 : 'Gamma', 
            916 : 'Delta', 
            917 : 'Epsilon', 
            918 : 'Zeta', 
            919 : 'Eta', 
            920 : 'Theta', 
            921 : 'Iota', 
            922 : 'Kappa', 
            923 : 'Lambda', 
            924 : 'Mu', 
            925 : 'Nu', 
            926 : 'Xi', 
            927 : 'Omicron', 
            928 : 'Pi', 
            929 : 'Rho', 
            931 : 'Sigma', 
            932 : 'Tau', 
            933 : 'Upsilon', 
            934 : 'Phi', 
            935 : 'Chi', 
            936 : 'Psi', 
            937 : 'Omega', 
            945 : 'alpha', 
            946 : 'beta', 
            947 : 'gamma', 
            948 : 'delta', 
            949 : 'epsilon', 
            950 : 'zeta', 
            951 : 'eta', 
            952 : 'theta', 
            953 : 'iota', 
            954 : 'kappa', 
            955 : 'lambda', 
            956 : 'mu', 
            957 : 'nu', 
            958 : 'xi', 
            959 : 'omicron', 
            960 : 'pi', 
            961 : 'rho', 
            962 : 'sigmaf', 
            963 : 'sigma', 
            964 : 'tau', 
            965 : 'upsilon', 
            966 : 'phi', 
            967 : 'chi', 
            968 : 'psi', 
            969 : 'omega', 
            977 : 'thetasym', 
            978 : 'upsih', 
            982 : 'piv', 
            8226 : 'bull', 
            8230 : 'hellip', 
            8242 : 'prime', 
            8243 : 'Prime', 
            8254 : 'oline', 
            8260 : 'frasl', 
            8472 : 'weierp', 
            8465 : 'image', 
            8476 : 'real', 
            8482 : 'trade', 
            8501 : 'alefsym', 
            8592 : 'larr', 
            8593 : 'uarr', 
            8594 : 'rarr', 
            8595 : 'darr', 
            8596 : 'harr', 
            8629 : 'crarr', 
            8656 : 'lArr', 
            8657 : 'uArr', 
            8658 : 'rArr', 
            8659 : 'dArr', 
            8660 : 'hArr', 
            8704 : 'forall', 
            8706 : 'part', 
            8707 : 'exist', 
            8709 : 'empty', 
            8711 : 'nabla', 
            8712 : 'isin', 
            8713 : 'notin', 
            8715 : 'ni', 
            8719 : 'prod', 
            8721 : 'sum', 
            8722 : 'minus', 
            8727 : 'lowast', 
            8730 : 'radic', 
            8733 : 'prop', 
            8734 : 'infin', 
            8736 : 'ang', 
            8743 : 'and', 
            8744 : 'or', 
            8745 : 'cap', 
            8746 : 'cup', 
            8747 : 'int', 
            8756 : 'there4', 
            8764 : 'sim', 
            8773 : 'cong', 
            8776 : 'asymp', 
            8800 : 'ne', 
            8801 : 'equiv', 
            8804 : 'le', 
            8805 : 'ge', 
            8834 : 'sub', 
            8835 : 'sup', 
            8836 : 'nsub', 
            8838 : 'sube', 
            8839 : 'supe', 
            8853 : 'oplus', 
            8855 : 'otimes', 
            8869 : 'perp', 
            8901 : 'sdot', 
            8968 : 'lceil', 
            8969 : 'rceil', 
            8970 : 'lfloor', 
            8971 : 'rfloor', 
            9001 : 'lang', 
            9002 : 'rang', 
            9674 : 'loz', 
            9824 : 'spades', 
            9827 : 'clubs', 
            9829 : 'hearts', 
            9830 : 'diams', 
            338 : 'OElig', 
            339 : 'oelig', 
            352 : 'Scaron', 
            353 : 'scaron', 
            376 : 'Yuml', 
            710 : 'circ', 
            732 : 'tilde', 
            8194 : 'ensp', 
            8195 : 'emsp', 
            8201 : 'thinsp', 
            8204 : 'zwnj', 
            8205 : 'zwj', 
            8206 : 'lrm', 
            8207 : 'rlm', 
            8211 : 'ndash', 
            8212 : 'mdash', 
            8216 : 'lsquo', 
            8217 : 'rsquo', 
            8218 : 'sbquo', 
            8220 : 'ldquo', 
            8221 : 'rdquo', 
            8222 : 'bdquo', 
            8224 : 'dagger', 
            8225 : 'Dagger', 
            8240 : 'permil', 
            8249 : 'lsaquo', 
            8250 : 'rsaquo', 
            8364 : 'euro'
        };
    }

用法示例：

var text = "Übergroße Äpfel mit Würmern";
alert(escapeHtmlEntities (text));

结果：

&Uuml;bergro&szlig;e &Auml;pfel mit W&uuml;rmern

更新1：再次感谢bucabay || - 提示
Update2： 使用 amp,lt,gt,apos,quot 更新了实体表，谢谢 richardtallent 提示
Update3（2014 年）： Mathias Bynens 创建 a lib called 'he'，也许它可以满足您的需求。

【讨论】：

看起来不错。我会去：escapeHtmlEntities.entityTable[c.charCodeAt(0)] || '#'+c.charCodeAt(0) 这样你就可以捕捉到那些不在 entityTable 中的 charCode。
这是一个很好的解决方案，可以很好地平衡捕获所有扩展的 Unicode 字符，但仍然为最常见的字符提供命名实体。您可能应该将 amp、gt 和 lt 添加到 entityTable。一个小警告：一些较旧的浏览器可能不支持您在该字典中拥有的所有命名实体。
@Chris 把它变成图书馆怎么样？ :)
另一个警告：所写的代码不能正确处理 U+10000 和更大的 Unicode 字符。为了处理这些，有必要添加代码以将每个 UTF-16 代理对组合成一个值。
@Mathias Bynens 我在答案中添加了指向您的库的链接。

【解决方案8】：

只需将@bucababy 的answer 重新发布为“书签”，因为有时它比使用those lookup 页面更容易：

alert(prompt('Enter characters to htmlEncode', '').replace(/[\u00A0-\u2666]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
}));

【讨论】：

这不适用于星体符号（例如，尝试?）和has some other issues too。考虑使用my online HTML entity encoder/decoder，而不是这个书签。
@MathiasBynens 您显然已经考虑了更多并且是正确的，但它仍然不是书签；）
附注链接到astral symbols，因为我不知道@MathiasBynens 在说什么
恕我直言，将 http://mothereff.in/html-entities#%s 作为自定义搜索引擎添加到我的浏览器要容易得多，但如果您坚持：javascript:void (function(){location='http://mothereff.in/html-entities#'+encodeURIComponent(pro‌mpt('Enter text to HTML-encode:',''))}())

【解决方案9】：

最佳解决方案发布在phpjs.org实现PHP函数htmlentities

格式为htmlentities(string, quote_style, charset, double_encode) 可以阅读有关 PHP 函数的完整文档here

【讨论】：

【解决方案10】：

拥有一个带有大量 replace() 调用的查找表速度很慢且不可维护。

幸运的是，内置的 escape() 函数也对大多数相同的字符进行编码，并以一致的格式（%XX，其中 XX 是字符的十六进制值）。

因此，您可以让 escape() 方法为您完成大部分工作，只需将其答案更改为 HTML 实体而不是 URL 转义字符：

htmlescaped = escape(mystring).replace(/%(..)/g,"&#x$1;");

这使用十六进制格式转义值而不是命名实体，但对于存储和显示值，它的工作原理与命名实体一样。

当然，escape 也可以转义 HTML 中不需要需要转义的字符（例如空格），但您可以通过几个 replace 调用来取消转义它们。

编辑：我更喜欢 bucabay 的答案... 处理更大范围的字符，并且之后无需破解即可使空格、斜杠等不转义。

【讨论】：

这种方法没有考虑the character references overrides in the HTML Standard。例如，htmlEncode('\x80') 不应返回 &#x80; 或 &#128;。事实上，它根本不应该返回 HTML 实体。没有有效的方法可以在 HTML 中表示该字符。 See my answer for more information, and for a better solution.
实际上这不适用于 \u00FF 范围之外的字符

【解决方案11】：

使用 escape() 应该适用于字符代码范围 0x00 到 0xFF (UTF-8 range)。

如果超出 0xFF (255)，例如 0x100 (256)，则 escape() 将不起作用：

escape("\u0100"); // %u0100

和：

text = "\u0100"; // Ā
html = escape(text).replace(/%(..)/g,"&#x$1;"); // &#xu0;100

所以，如果你想覆盖 http://www.w3.org/TR/html4/sgml/entities.html 上定义的所有 Unicode 字符，那么你可以使用类似的东西：

var html = text.replace(/[\u00A0-\u00FF]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
});

注意这里的范围是：\u00A0-\u00FF。

这是http://www.w3.org/TR/html4/sgml/entities.html 中定义的第一个字符代码范围，与 escape() 涵盖的内容相同。

您还需要添加要覆盖的其他范围或所有范围。

示例：带有通用标点符号的 UTF-8 范围（\u00A0-\u00FF 和 \u2022-\u2135）

var html = text.replace(/[\u00A0-\u00FF\u2022-\u2135]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
});

编辑：

顺便说一句：\u00A0-\u2666 应该将所有不在 ASCII 范围内的 Unicode 字符代码盲目地转换为 HTML 实体：

var html = text.replace(/[\u00A0-\u2666]/g, function(c) {
   return '&#'+c.charCodeAt(0)+';';
});

【讨论】：

非常好，bucabay...我正在处理最简单的 UTF8 情况并快速破解，但这绝对是一个更强大的解决方案。很好地使用传递函数来处理正则表达式替换，我忘记了能够做到这一点。赞成。但是，需要快速修复以在字符范围中添加与号、小于和大于，以便它可以完全替换我的代码。
这比那些 htmlencode 查找服务要容易得多。试试它作为书签？ alert(prompt('Enter characters to htmlEncode', '').replace(/[\u00A0-\u2666]/g, function(c) { return '&#'+c.charCodeAt(0)+';'; }));
这种方法没有考虑the character references overrides in the HTML Standard。例如，htmlEncode('\x80') 不应返回 &#x80; 或 &#128;。事实上，它根本不应该返回 HTML 实体。没有有效的方法可以在 HTML 中表示该字符。 See my answer for more information, and for a better solution.
最后一次编辑拯救了这一天。谢谢你。我正在从服务器上取回一些 HTML，并尝试在弹出窗口中打开它。这派上用场了。

【解决方案12】：

我改编了引用问题中的一个答案，但添加了为角色名称定义显式映射的能力。

var char_names = {
    160:'nbsp',
    161:'iexcl',
    220:'Uuml',
    223:'szlig',
    196:'Auml',
    252:'uuml',
    };

function HTMLEncode(str){
     var aStr = str.split(''),
         i = aStr.length,
         aRet = [];

     while (--i >= 0) {
      var iC = aStr[i].charCodeAt();
       if (iC < 32 || (iC > 32 && iC < 65) || iC > 127 || (iC>90 && iC<97)) {
        if(char_names[iC]!=undefined) {
         aRet.push('&'+char_names[iC]+';');
        }
        else {
         aRet.push('&#'+iC+';');
        }
       } else {
        aRet.push(aStr[i]);
       }
    }
    return aRet.reverse().join('');
   }

var text = "Übergroße Äpfel mit Würmer";

alert(HTMLEncode(text));

【讨论】：