将 NCR（Numeric Character Reference）字符转换为真实字符的方法

时间 2019-12-04

标签 ncr numeric character reference 字符转换真实方法繁體版

原文原文链接

开发过程当中遇到一种奇怪的编码格式:html

&#27599;&#26085;&#19968;&#33394;|&#34013;&#30333;~

使用decode/unescape/decodeURI解码均无效.研究一番,总结一下.浏览器

实际上上面这种奇怪的编码格式并非编码,而是一种叫作 NCR(Numeric Character Reference) 的标记结构.编码

Numeric Character Reference

看看维基百科的解释：prototype

A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Charactcode

NCR是一种常见的标记结构，用于SGML和其余SGML类似的标记语言，如HTML和XML。它由一个短的字符序列组成,表明一个字符（全球的文字字符）。htm

NCR编码是由一个与号(&)跟着一个井号(#), 而后跟着这个字符的Unicode编码值, 最后跟着一个分号组成的, 如:blog

&#dddd;
&#xhhhh;
&#name;

其中, dddd是字符编码的十进制表示, 而hhhh是字符的16进制表示.ip

以 HTML 为例，这三种转义序列都称做 character reference：
前两种是 numeric character reference（NCR），数字取值为目标字符的 Unicode code point；以「」开头的后接十进制数字，以「」开头的后接十六进制数字。
后一种是 character entity reference，后接预先定义的 entity 名称，而 entity 声明了自身指代的字符。
从 HTML 4 开始，NCR 以 Unicode 为准，与文档编码无关。开发

「中国」二字分别是 Unicode 字符 U+4E2D 和 U+56FD，十六进制表示的 code point 数值「4E2D」和「56FD」就是十进制的「20013」和「22269」。因此——文档

&#x4e2d;&#x56fd;
&#20013;&#22269;

——这两种 NCR 写法都会在显示时转换为「中国」二字。

如何将 NCR 字符转换成真实字符

方法以下:

var regex_num_set = /&#(\d+);/g;
var str = "Here is some text: &#27599;&#26085;&#19968;&#33394;|&#34013;&#30333;~"

str = str.replace(regex_num_set, function(_, $1) {
  return String.fromCharCode($1);
});

document.write('<pre>'+JSON.stringify(str,0,3));

以上例子使用了 String.prototype.replace() 和 String.fromCharCode() 方法. 思路为将字符串中的 NCR 字符逐个获取到 ""和";"间的 Unicode 字符编码值, 而后利用 String.fromCharCode() 方法, 将 Unicode 编码转为真实字符.

博客文章地址：http://joebon.cc/convert-numeric-chracter-reference-to-actual-character

将 NCR（Numeric Character Reference） 字符转换为真实字符的方法

Numeric Character Reference

如何将 NCR 字符转换成真实字符

参考资料

将 NCR（Numeric Character Reference）字符转换为真实字符的方法