UTF-8是如何编码的？

时间 2019-11-21

原文原文链接

众所周知计算机上存储的是二进制0和1，string字符串是如何转变为二进制0和1的呢？java

每个字符都会转换为对应的16进制，16进制也是一堆01代码，就至关于存储在计算机上的01代码。git

不一样的字符集经过不一样的编码方式存储不一样数目的字节数。下面以UTF-8是如何编码存储字符为二进制的为例子进行说明：github

String a = “A”

a.getBytes().length is 1

byte array is [65]


String a = "ë"

a.getBytes().length is 2

byte array is [-61, -85]

如上所示： A字符占用一个字节 ë字符占用两个字节。web

etBytes()假设默认编码方式为UTF-8。ui

一些字符是一个字节，一些字符是两个字节，或者更多的字节，那么如何进行解码呢？this

UTF-8如何进行编码？ 在Wikipedia中给出了相关的规则：编码

if the first byte starts with 0 then it is a single byte char翻译

if the first byte starts with 110 then it is 2 bytescode

if the first byte starts with 1110 then it is 3 bytes图片

if the first byte starts with 11110 then it is 4 bytes

if the first byte starts with 111110 then it is 5 byte

if the first byte starts with 1111110 then it is 6 byte

翻译：若是第一个字节以0开始，表明是一个单字节字符。若是第一个字节以110开始，表明是双字节字符。若是第一个字节以1110开始，表明是三字节字符。若是第一个字节以11110开始，表明是四字节字符。若是第一个字节以111110开始，表明是五字节字符。若是第一个字节以1111110开始，表明是六字节字符。

因此咱们解码就是反推便可： if the first byte starts with 0 then it is a single byte char so it decodes only that byte

if the first byte starts with 110 then it is 2 byte so it decodes 2 consecutive bytes

if the first byte starts with 1110 then it is 3 byte so it decodes 3 consecutive bytes

if the first byte starts with 11110 then it is 4 byte so it decodes 4 consecutive bytes

if the first byte starts with 111110 then it is 5 byte so it decodes 5 consecutive bytes

if the first byte starts with 1111110 then it is 6 byte so it decodes 6 consecutive bytes

下面用表格的方式列出Unicode和16进制以及占用字节之间的关系：

实例实战

110 xxxxx 10 xxxxxx

110 00011 10 101011

00011       101011  → binary equivalent of hex pointing to ë

ɟ 110 xxxxx 10 xxxxxx

110 01001 10 011111

01001     011111   → binary equivalent of hex pointing to ɟ

11100000 10101101 10011111如何解码？ 1110表明是三个字节为一个字符： 1110xxxx 10xxxxxx 10xxxxxx

11100000 10101101 10011111

so 0000 101101 011111 is the binary to be decoded.

因此为 0000 101101 011111 每四位为： 0000 1011 0101 1111 为：B5F

The binary is B5F in hexadecimal (If you don't know to convert use this binary to hex converter website ) Now from map B5F means ୟ .

练习：对01000010 01000001 11000011 10110000 11100010 10001011 10110011进行解码

一、第一个字符 01000010 为一个字符： 0100 0010为：42 参考这里对应字符B

二、第二个字符

01000001 为一个字符： 0100 0001为：41 参考表格对应字符A

三、第三个字符 11000011 10110000 为一个字符： 0000 11 110000 就是F0，参考表格映射为字符：ð

四、第四个字符： 11100010 10001011 10110011 为一个字符： 00010 001011 110011 就是 22F3 参考表格映射为字符：⋳

结论就是：01000010 01000001 11000011 10110000 11100010 10001011 10110011 采用UTF-8编码为BAð⋳

String 的 getBytes("UTF-8")作了什么操做呢？

String s = "ABCDEF⋳";

ABCDEF⋳经过getBytes("UTF-8")被编码为UTF-8格式，它是如何存储的呢？ A - 01000001

B - 01000010

C - 01000011

D - 01000100

E - 01000101

F - 01000110

⋳ - 11100010 10001011 10110011

注意：以上是以字节的形式存储在内存中

因此getBytes("UTF-8")是获取每个字节返回。

在内存中是如何存储的呢？

01000001 表明正数 65 可是11100010 表明负数 -31

因此存储在内存中为： 01000001 - 65

01000010 - 66

01000011 - 67

01000100 - 68

01000101 - 69

01000110 - 70

11100010 - -31

10001011 - -117

10110011 - -77

代码为证：

String s = "ABCDEF⋳";
        	 byte[]  bs = s.getBytes("UTF-8");
        	 for(byte  b : bs)
        	 System.out.print(b+",");

输出：

65,66,67,68,69,70,-30,-117,-77

Reference1 Reference2