ASCII,Unicode and Base64

时间 2019-11-17

原文原文链接

ASCII码

众所周知，任何数据在计算机中都是以二进制的方式存储的。全世界的数据囊括起来有：英文字母、英文标点、阿拉伯数字、文字、符号。那么在计算机内部是如何表示这些数据的呢？最初的ASCII码对英文字母、英文标点、阿拉伯数字进行编码，一个字节表示一个字符，只用了低7位，一共2**7=128个字符，学习到这里的时候，我特地数了下咱们的键盘上，有52个字母（分大小写），42个英文标点，再加上1个空格符，总共是95个可显示的字符，那剩下的33个字符是什么字符呢，能够查看wiki，对咱们理解计算机编码没什么用，因此这里先忽略。html

Unicode

很显然，这样的字符集，是没法处理咱们广袤的语言文字的。因此出现了统一编码字符集Unicode，它是全世界范围内的统一编码规则，惟一的编码对应惟一的符号，在ASCII码的基础上，加入了对各类语言文字甚至新型的表情等符号的编码，而且仍然在不断的增修中。在表示一个Unicode的字符时，一般会用“U+”而后紧接着一组十六进制的数字来表示这一个字符。Unicode的编码空间是U+0000至U+10FFFF，在这个空间内，分为17（0-16）组空间，每组被称为平面，第0组平面，又称为基本多文种平面（BMP），范围在U+0000至U+FFFF，其余平面看下图了解一下。另附上字符对应表unicode.org和汉字对应表，不妨也打开看看。算法

若是有心，上面的连接你已经打开了，你会看到大多字符都是使用U+xxxx这样的2个字节16bits表示的，例如字“回”，它的Unicode码是U+56DE。每一个字符的编码有了，那在计算机中怎么存储和处理一连串的字符呢，也就是说编码规则是如何实现的？有UTF-8/UTF-16/UTF-32三种实现方式，其中经常使用的是UTF-8和UTF-16。api

在查阅unicode的时候，我老是会看到UCS-2 UCS-4这样的描述。“UCS-2 is outdated, though still widely used in software”，Unicode English wiki上有这么一句话。也就是说USC-2是一种过期的叫法，它还有一个最新的叫法UTF-16，这样是否是就明白了？由于2是指2个字节，16是指16位。固然UTF-16和UCS-2确实不彻底相等，可是没有必要再深究了。下面一段话摘自wiki:bash

Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Coded Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code values. All UTF encodings map all code points (except surrogates) to a unique sequence of bytes.[54] The numbers in the names of the encodings indicate the number of bits per code value (for UTF encodings) or the number of bytes per code value (for UCS encodings). UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.app

UTF-8

UTF-8是一种变长的编码方式，可使用1~6个字节表示一个字符，可是Unicode最大只到U+10FFFF，因此最多4个字节。它的编码规则以下：编辑器

1. 取得字符的Unicode码，找出它在下表第一列中处于哪一个范围
1. 找到范围所对应的二进制的格式
1. 将Unicode码转换为二进制，且从右到左的填充入x，多出的位补0

字符的Unicode编码范围 |        UTF-8 编码方式
        (十六进制)      |           (二进制)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
复制代码

如此编码完成，试试看“回”字的UTF-8编码结果，答案应该是11100101 10011011 10011110，十六进制表示是e59b9e，最后咱们能够用编辑器来验证下： ide

UTF-16

UTF-16使用固定的一个或两个无符号的16位整数来编码。它的编码规则以下：学习

1. 取得字符的Unicode码 U，若是小于0x10000，编码结果就是它本身
1. 若是U在0x10000-0x10FFFF范围内，U' = U - 0x10000，且U'确定不超过20位的，则将它分红两个10位，分别填充入W1 = 0xD800的后10位和W2 = 0xDC00的后10位中看起来像是这样：

U' = yyyyyyyyyyxxxxxxxxxx W1 = 110110yyyyyyyyyy W2 = 110111xxxxxxxxxx 复制代码

颇有意思哈～，试试计算U+10437的UTF-16的编码结果，答案是1101 1000 0000 0001 1101 1100 0011 0111，十六进制的表示结果是d801 dc37。这里还有一个大端序和小端序的概念，这是描述CPU如何向内存写数据的概念，计算机在处理2个8位字节的时候，若将高位字节存放在低内存地址，则称为“大端序”。若将高位字节存放在高位地址，则称为“小端序”。那么d801 dc37则是大端序Big endian，01d8 37dc则是小端序little endian。一样能够在编辑器中去验证UTF-16的编码结果。ui

Javascript

js内部使用的编码是UTF-16，咱们不妨来看下编码相关的api。this

`String.prototype.charCodeAt`

此方法返回的是字符对应的Unicode码的整数值。例如：

var sentence = '回家吧！';

var index = 0;

console.log('The character code ' + sentence.charCodeAt(index) + ' is equal to ' + sentence.charAt(index));
// expected output: "The character code 22238 is equal to 回"
复制代码

那么变体，看下“回”的16进制表示：

var sentence = '回家吧！';

var index = 0;

console.log('The character code ' + sentence.charCodeAt(index).toString(16) + ' is equal to ' + sentence.charAt(index));
// expected output: "The character code 56de is equal to 回"
复制代码

`String.fromCharCode`

此方法将UTF-16转换为字符串。

console.log(String.fromCharCode(22238));
// expected output: "回"
复制代码

Base64

Base64是以每3个8位为一个单元，转换为4个6位的格式，6位的高两位填充0，这样的8位一共有2**6=64个字符，对应有一个Base64的索引表，找出索引表对应的可打印字符，如此便生成一个Base64字符。但有可能原数据不是3的整数倍，那么若是余下两个输入数据，在编码结果后加1个“=”；若是余下一个输入数据，编码结果后加2个“=”。在这个Base64的算法中，要清晰的认识一点，当string的编码方式不一样时，获得的Base64 string结果也会不一样。

在Javascript中，有两个内置的方法btoa()和atob()分别对ASCII码进行Base64的编码和解码。可是此方法只支持ASCII码，Unicode string怎么办？MDN给的解决方案是：

// ucs-2 string to base64 encoded ascii
function utoa(str) {
    return window.btoa(unescape(encodeURIComponent(str)));
}
// base64 encoded ascii to ucs-2 string
function atou(str) {
    return decodeURIComponent(escape(window.atob(str)));
}
// Usage:
utoa('✓ à la mode'); // 4pyTIMOgIGxhIG1vZGU=
atou('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

utoa('I \u2661 Unicode!'); // SSDimaEgVW5pY29kZSE=
atou('SSDimaEgVW5pY29kZSE='); // "I ♡ Unicode!"
复制代码

看到这里的时候，有人会有疑问，在解码的时候，若是是客户端或者服务端，难道也能够正确的解码吗？我认为答案是能够的。首先看下另外一套网上的解决方案：

var Base64 = {
	// 转码表
	table : [
			'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
			'I', 'J', 'K', 'L', 'M', 'N', 'O' ,'P',
			'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
			'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f',
			'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
			'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
			'w', 'x', 'y', 'z', '0', '1', '2', '3',
			'4', '5', '6', '7', '8', '9', '+', '/'
	],
	UTF16ToUTF8 : function(str) {
		var res = [], len = str.length;
		for (var i = 0; i < len; i++) {
			var code = str.charCodeAt(i);
			if (code > 0x0000 && code <= 0x007F) {
				// 单字节，这里并不考虑0x0000，由于它是空字节
				// U+00000000 – U+0000007F 	0xxxxxxx
				res.push(str.charAt(i));
			} else if (code >= 0x0080 && code <= 0x07FF) {
				// 双字节
				// U+00000080 – U+000007FF 	110xxxxx 10xxxxxx
				// 110xxxxx
				var byte1 = 0xC0 | ((code >> 6) & 0x1F);
				// 10xxxxxx
				var byte2 = 0x80 | (code & 0x3F);
				res.push(
					String.fromCharCode(byte1), 
					String.fromCharCode(byte2)
				);
			} else if (code >= 0x0800 && code <= 0xFFFF) {
				// 三字节
				// U+00000800 – U+0000FFFF 	1110xxxx 10xxxxxx 10xxxxxx
				// 1110xxxx
				var byte1 = 0xE0 | ((code >> 12) & 0x0F);
				// 10xxxxxx
				var byte2 = 0x80 | ((code >> 6) & 0x3F);
				// 10xxxxxx
				var byte3 = 0x80 | (code & 0x3F);
				res.push(
					String.fromCharCode(byte1), 
					String.fromCharCode(byte2), 
					String.fromCharCode(byte3)
				);
			} else if (code >= 0x00010000 && code <= 0x001FFFFF) {
				// 四字节
				// U+00010000 – U+001FFFFF 	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
			} else if (code >= 0x00200000 && code <= 0x03FFFFFF) {
				// 五字节
				// U+00200000 – U+03FFFFFF 	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
			} else /** if (code >= 0x04000000 && code <= 0x7FFFFFFF)*/ {
				// 六字节
				// U+04000000 – U+7FFFFFFF 	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
			}
		}

		return res.join('');
	},
	UTF8ToUTF16 : function(str) {
		var res = [], len = str.length;
		var i = 0;
		for (var i = 0; i < len; i++) {
			var code = str.charCodeAt(i);
			// 对第一个字节进行判断
			if (((code >> 7) & 0xFF) == 0x0) {
				// 单字节
				// 0xxxxxxx
				res.push(str.charAt(i));
			} else if (((code >> 5) & 0xFF) == 0x6) {
				// 双字节
				// 110xxxxx 10xxxxxx
				var code2 = str.charCodeAt(++i);
				var byte1 = (code & 0x1F) << 6;
				var byte2 = code2 & 0x3F;
				var utf16 = byte1 | byte2;
				res.push(String.fromCharCode(utf16));
			} else if (((code >> 4) & 0xFF) == 0xE) {
				// 三字节
				// 1110xxxx 10xxxxxx 10xxxxxx
				var code2 = str.charCodeAt(++i);
				var code3 = str.charCodeAt(++i);
				var byte1 = (code << 4) | ((code2 >> 2) & 0x0F);
				var byte2 = ((code2 & 0x03) << 6) | (code3 & 0x3F);
				var utf16 = ((byte1 & 0x00FF) << 8) | byte2
				res.push(String.fromCharCode(utf16));
			} else if (((code >> 3) & 0xFF) == 0x1E) {
				// 四字节
				// 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
			} else if (((code >> 2) & 0xFF) == 0x3E) {
				// 五字节
				// 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
			} else /** if (((code >> 1) & 0xFF) == 0x7E)*/ {
				// 六字节
				// 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
			}
		}

		return res.join('');
	},
	encode : function(str) {
		if (!str) {
			return '';
		}
		var utf8    = this.UTF16ToUTF8(str); // 转成UTF8
		var i = 0; // 遍历索引
		var len = utf8.length;
		var res = [];
		while (i < len) {
			var c1 = utf8.charCodeAt(i++) & 0xFF;
			res.push(this.table[c1 >> 2]);
			// 须要补2个=
			if (i == len) {
				res.push(this.table[(c1 & 0x3) << 4]);
				res.push('==');
				break;
			}
			var c2 = utf8.charCodeAt(i++);
			// 须要补1个=
			if (i == len) {
				res.push(this.table[((c1 & 0x3) << 4) | ((c2 >> 4) & 0x0F)]);
				res.push(this.table[(c2 & 0x0F) << 2]);
				res.push('=');
				break;
			}
			var c3 = utf8.charCodeAt(i++);
			res.push(this.table[((c1 & 0x3) << 4) | ((c2 >> 4) & 0x0F)]);
			res.push(this.table[((c2 & 0x0F) << 2) | ((c3 & 0xC0) >> 6)]);
			res.push(this.table[c3 & 0x3F]);
		}

		return res.join('');
	},
	decode : function(str) {
		if (!str) {
			return '';
		}

		var len = str.length;
		var i   = 0;
		var res = [];

		while (i < len) {
			code1 = this.table.indexOf(str.charAt(i++));
			code2 = this.table.indexOf(str.charAt(i++));
			code3 = this.table.indexOf(str.charAt(i++));
			code4 = this.table.indexOf(str.charAt(i++));

			c1 = (code1 << 2) | (code2 >> 4);
			res.push(String.fromCharCode(c1));

			if (code3 != -1) {
				c2 = ((code2 & 0xF) << 4) | (code3 >> 2);
				res.push(String.fromCharCode(c2));
			}
			if (code4 != -1) {
				c3 = ((code3 & 0x3) << 6) | code4;
				res.push(String.fromCharCode(c3));
			}

		}

		return this.UTF8ToUTF16(res.join(''));
	}
};
复制代码

两套方案对比，输出的Base64 string是同样的。在第二套方案中，先将UTF16编码的string处理为utf8编码的字符串，再将utf8 string转换为Base64。两个方案获得的结果是同样的，由于第一套方案中，encodeURIComponent也是作的utf8编码处理。因此我认为只要是加解密双方使用统一编码方式，获得的信息确定是同样的，至于他们如何解码，自是他们的事情。

最后

最后附一个公式，由于我本身总是会忘记，在写这篇文章的过程当中还不停翻本身写在笔记本上的：

1字节(byte) = 8位(bit)

1字符 = 一个或多个字节

参考资料

tools.ietf.org/html/rfc362…

tools.ietf.org/html/rfc278…

en.wikipedia.org/wiki/Unicod…

my.oschina.net/goal/blog/2…

developer.mozilla.org/en-US/docs/…

www.ruanyifeng.com/blog/2007/1…

zh.wikipedia.org/wiki/Base64

www.ruanyifeng.com/blog/2008/0…