Unicode、UTF-八、UTF-16之间的关系

时间 2019-11-07

原文原文链接

一、为何须要Unicode 在很早之前全部，在计算机的世界里只有ASCII，后来多了一些控制字符、标点等，最后就是今天的世界里你可以看到不少种语言在一个文档中，例如：English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ ，后期或许会出现更多的其余语言的字符，计算机中须要显示全部的这些语言的字符。所以：一个包容全部语言字符的字符集颇有必要，这就是Unicode的诞生的意义。ios

二、Unicode简介 Unicode是一个包含世界上全部语言字符的字符集，它为世界上每个字符分配一个惟一的数字，官方术语叫 code point（码位）。Unicode的一个很大的优势是，码位的前256位和ISO-8859-1以及ASCII同样。大部分经常使用的字符经过一到两个字节就能够表示。less

三、为何须要UTF-8或者UTF-16等编码 虽然Unicode可以包容全部的字符集，可是咱们直接看Unicode码很不方便，像看天书同样，咱们对咱们经常使用的文字最熟悉，因此就须要把咱们经常使用的可读性强的文字和Unicode字符集一一对应。这个过程叫编码。经常使用的UTF-八、GBK、UTF-16等都是不一样的编码方式，这些都是把咱们看到的文字和Unicode字符集对应起来的规则。ide

四、UTF-8和UTF-16之间的区别this

一、基于内存考虑的比较：编码

UTF-8: 1 byte: Standard ASCII 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian) 3 bytes: BMP 4 bytes: All Unicode characterscode

UTF-16: 2 bytes: BMP 4 bytes: All Unicode characterscomponent

实例： UTF-8编码： 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)orm

UTF-16编码： 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)three

五、UTF-8和UTF-16的优缺点比较 UTF-8和UTF-16都是基于可变长度的编码方式。UTF-8最小是8 bit，UTF-16最少是16 bit。ip

UTF-8优势： 1.兼容基本的ASCII和US-ASCII. 2.No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too. 3.UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

UTF-8缺点：

1.Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly. 2.Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

UTF-16优势 1.BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters. 2.Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

UTF-16缺点 1.Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory. 2.Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters! 3.It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.

实例参考：

"A" in ASCII is hex 0x41; in UTF-8 it is also 0x41; in UTF-16 it is 0x0041 "À" in Latin-1 is 0xC0; in UTF-8 it is 0xC3 0x80; in UTF-16 it is 0x00C0, The Tibetan letter ཨ in UTF-8 is 0xE0 0xBD 0xA8; it UTF-16 it is 0x0F68, This character*: http://www.fileformat.info/info/... in UTF-8 is 0xF0 0xA0 0x80 0x8B; in UTF-16 it is 0xD840 0xDC0B

比较参考

文章参考