code point，code unit

时间 2019-11-10

标签 code point unit 栏目 Java开源繁體版

原文原文链接

从一段API描述谈起：在String的length的API中描述是这样的！java

length

public int length()
Returns the length of this string. The length is equal to the number of 16-bit Unicode characters in the string.

Specified by:
length in interface CharSequence

Returns:
the length of the sequence of characters represented by this object.

其中有一句话：学习

The length is equal to the number of 16-bit Unicode characters in the string.

直译过来就是： length的大小和 16 bit 的Unicode字符的个数相同！ui

一、为何是16bit？this

Unicode是包括目前世界上几乎全部语言的字符集，每个字符对应的一个惟一编号，这个编号规则是：经常使用的Unicode称谓：BMP，包含了大量的字符集，目前Unicode版本是8.0，BMP是U+0000-U+FFFF表明的字符集。固然了后期又扩展了不少。编码

能够看到BMP在U+0000-U+FFFF之间的字符，每个字符的Unicode编码对应的是四个16进制，每一个16进制用四个bit表示，因此一个Unicode就是16 bit。翻译

因此BMP内的字符都是由16Bit组成，因此有多少个16bit就有多少个字符。code

[Unicode BMP](https://en.wikipedia.org/wiki/Plane_(Unicode) Unicode和UTF-8对应关系图片

二、String API codePoint什么意思？ip

每个16bit的Unicode就是一个codePointci

关于code point、code unit的对应关系：

wikipedia关于code_point

三、code unit是个什么概念？

The code unit size is equivalent to the bit measurement for the particular encoding:

A code unit in US-ASCII consists of 7 bits; A code unit in UTF-8, EBCDIC and GB18030 consists of 8 bits; A code unit in UTF-16 consists of 16 bits; A code unit in UTF-32 consists of 32 bits. 翻译：在US-ASCII中一个code unit表明7bits 在UTF-8，EBCDIC和GB18080中一个code unit表明8bits 在UTF-16中一个code unit表明16bits 在UTF-32中一个code unit表明32bits

总结：

code point是从unicode上定义的概念，是指一个字符集好比A表明的16bits。也就是字符的个数。

好比：

String   s = "π王A23";
		//π用Unicode表明一个16bit的code point
		//王用Unicode表明一个16bit的code point
		//A用Unicode表明一个16bit的code point
		//2用Unicode表明一个16bit的code point
		//3用Unicode表明一个16bit的code point
		System.out.println("字符串s的长度为："+s.length());
		System.out.println("第三个code point为："+s.codePointAt(2));

输出：

字符串s的长度为：5
第三个code point为：65

其中5表明5geunicode字符，每一个字符是一个16bit的unicode。 65是表明字母A的标示。是第三个字符A

关于unicode学习最好的方式就是参考Wikipedia中的讲述