前两天在网上看到一篇关于编码的讨论,仔细学习了一下unicode,utf8,utf16的定义。这篇博客旨在让读者真正理解他们是什么。html
在阅读本文以前建议读者先去阅读这篇文章:http://www.freebuf.com/articles/others-articles/25623.html,若是你没有耐心读完他也不要紧,只须要明白三个道理:java
1,这个世界上历来没有纯文本这回事,若是你想读出一个字符串,你必须知道它的编码。若是你不知道一段数据流的编码方式,你就永远不会知道这里面的内容。网络
2,Unicode是一个简单的标准,用来把字符映射到数字上。Unicode协会的人会帮你处理全部幕后的问题,包括为新字符指定编码。咱们用的全部字符都在unicode里面有对应的映射,每一个映射称为一个码点(http://en.wikipedia.org/wiki/Code_point)学习
3,Unicode并不告诉你字符是怎么编码成字节的。这是被编码方案决定的,经过UTF来指定。编码
读完前面这篇文章以后你也许就了解了一个二进制流到屏幕字符的过程:spa
二进制流->根据编码方式解码出码点->根据unicode码点解释出字符->系统渲染绘出这个字符设计
文本字符保存到计算机上的过程:code
输入字符->根据字符找到对应码点->根据编码方式把码点编码成二进制流->保存二进制流到硬盘上orm
从这个过程咱们能够知道能不能从二进制流读取出字符关键就在于能不能找到二进制流的编码,掌握了编码方式的信息就能够用对应的逆过程解码。htm
看到这里有读者必定会问:为何要编码,根据二进制流计算码点很差吗?
缘由是良好设计的编码能够为咱们提供不少附加的功能,包括容错纠错(在网络通讯中尤为重要),自同步(没必要从文本头部开始就能够解码)等等。编码从信息论的角度上来讲就是增长了冗余的信息,冗余的这部分信息就能够为咱们提供额外的功能。
咱们来看utf8和utf16具体是如何编码的:
Utf8有以下特色:
1.可变长编码,由第一个字节决定该字符编码长度
2.向下兼容ascii码(这也是为何用utf8编码能够完美打开ascii文本文件)
Utf8的编码规则:
开头字节以若干个1开头(长度为几就有几个1,所以只要读完开头字节就能够知道本字符共有多少个字节),后接1个0.后续字节都以10开头
具体来举几个例子:
字符 | 码点 | 二进制 UTF-8 | 16进制 UTF-8 | |
---|---|---|---|---|
$ | U+0024 |
0100100 |
00100100 |
24 |
¢ | U+00A2 |
000 10100010 |
11000010 10100010 |
C2 A2 |
€ | U+20AC |
00100000 10101100 |
11100010 10000010 10101100 |
E2 82 AC |
𤭢 | U+24B62 |
00010 01001011 01100010 |
11110000 10100100 10101101 10100010 |
F0 A4 AD A2 |
public class Utf8 { /** * @param codePoint in unicode * @return corresponding utf8 bytes * @throws Exception */ private static final long RightSix = (1 << 6) - 1; private static final long PrefixForContinuasByte = 1 << 7; public static long EncodeToUtf8(long codePoint) throws Exception { if (codePoint < 0 || codePoint > 0x1FFFFF) throw new Exception("Illegal code point!"); if (codePoint <= 0x007F) { return codePoint;// ascii character } else if (codePoint <= 0x07FF) { long byte1 = (6 << 5) + (codePoint >> 6); long byte2 = PrefixForContinuasByte + (codePoint & RightSix); return (byte1 << 8) + byte2; } else if (codePoint <= 0xFFFF) { long byte1 = (14 << 4) + (codePoint >> 12); long byte2 = PrefixForContinuasByte + ((codePoint >> 6) & RightSix); long byte3 = PrefixForContinuasByte + (codePoint & RightSix); return (byte1 << 16) + (byte2 << 8) + byte3; } else { long byte1 = (30 << 3) + (codePoint >> 18); long byte2 = PrefixForContinuasByte + ((codePoint >> 12) & RightSix); long byte3 = PrefixForContinuasByte + ((codePoint >> 6) & RightSix); long byte4 = PrefixForContinuasByte + (codePoint & RightSix); return (byte1 << 24) + (byte2 << 16) + (byte3 << 8) + byte4; } } public static void main(String[] args) { try { while (true) { System.out.print("Input a number in Hex format:"); Scanner sc = new Scanner(System.in); String s = sc.nextLine(); // System.out.println("it is "+HexStringToLong(s)+" in decimal format"); long utf8 = EncodeToUtf8(HexStringToLong(s)); String hexString = Long.toHexString(utf8); System.out.println("You input " + s + " in Hex format and we encode it to utf8 character " + hexString); } } catch (Exception e) { System.out.println(e.getLocalizedMessage()); // TODO: handle exception } } }
运行结果:
Input a number in Hex format:24 You input 24 in Hex format and we encode it to utf8 character 24 Input a number in Hex format:A2 You input A2 in Hex format and we encode it to utf8 character c2a2 Input a number in Hex format:20AC You input 20AC in Hex format and we encode it to utf8 character e282ac Input a number in Hex format:24B62 You input 24B62 in Hex format and we encode it to utf8 character f0a4ada2
关于UTF-16的编码规则,读者能够参考这篇文章:http://en.wikipedia.org/wiki/UTF-16
这里附上UTF16-BE的编码代码:
public class Utf16 { /** * @param codePoint in unicode * @return corresponding utf16 bytes * @throws Numberformat Exception */ private static final long Substracted=0x10000; private static final long AddToHigh=0xD800; private static final long AddToLow=0xDC00; private static long HexStringToLong(String s) { if (s.length() == 0) return 0; long ans = 0; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); if (c >= '0' && c <= '9') ans = (ans << 4) + (c - '0'); else if (c >= 'A' && c <= 'F') ans = (ans << 4) + (c - 'A' + 10); else throw new NumberFormatException(); } return ans; } public static long EncodeToUtf16BE(long codePoint) throws Exception { if(codePoint<0||(codePoint<=0xDFFF&&codePoint>=0xD800)||codePoint>0x10FFFF) throw new NumberFormatException(); if(codePoint<=0xD7FF)//Basic Multilingual Plane { return codePoint; } else { long sub=codePoint-Substracted; long high=sub>>10; long low=sub&0x3FF; long word1=AddToHigh+high; long word2=AddToLow+low; return (word1<<16)+word2; } } public static void main(String[] args) { while(true) { System.out.print("Input a number in hex format"); Scanner sc=new Scanner(System.in); String s=sc.nextLine(); try { String utf16=Long.toHexString(EncodeToUtf16BE(HexStringToLong(s))); System.out.println("You input "+s+" we encode it to utf16-BE "+utf16); } catch (Exception e) { e.printStackTrace(); } } } }