Python的编码问题

时间 2019-11-08

标签 python 编码问题栏目 Python 繁體版

原文原文链接

近平常常python的编码问题纠缠的生活不能自理. 昨天终于静下心来看了看文档, 把Python3中的编码搞清, 用这篇文章分享记录一下(包括utf-8的原理).
python

提示:

下文中都是以python3为栗子🌰.
由于python3慢慢变成主流, 并且用python2的话我通常会写成兼容的模式:
>>> from __future__ import print_function, unicode_literals算法

编码在python2和3中的区别(可跳过, 最后回过头来看):

摘自Effective Python那本书:
_
In Python3:
1. bytes: sequences of 8-bit values.
2. str: sequences of Unicode characters.
bytes and str instances can’t be used with operators(like > or +)
_
In Python 2:
1. str: contains sequences of 8-bit values.
2. unicode: contains sequences of Unicode characters.
str and unicode can be used together with operators if the str only contains 7-bit ASCII characters.
_
但说实话在今天前, 我对上边那段话的理解仍是停留在python3 有两种类型(str和bytes)的地步😓.编码

1. Python3 str类型(unicode)

python3的str字符串, 默认就表明unicode字符组成的序列.spa

 
     In [1]: s = '哈哈哈' In [2]: type(s) Out[2]: str

那问题来了, 到底什么是unicode呢?
你们都知道ASCII编码, 它用7位bits表明128个字符.
但一个字节不够用的时候, 不少聪明的人就发明了不少的扩展的字符集.
但是这时候碰到了一个问题, 就是一台电脑在美利坚可能用的好好的, 但若是收到日本的邮件, 那就GG了, 由于两台电脑的编码方式不一样.code

全部后来更聪明的人就想到了unicode:
它对世界上全部的字符进行收集, 每一个字符指向一个code point(简单理解为一个惟一的数字), 这样全世界交流也不会乱码了, 棒棒哒.
因此unicode的一个中文名也叫万国码.blog

2. Python3 bytes类型(字节)

bytes和str同样都是内置的类型:utf-8

 
     In [7]: s = b'haha' In [8]: type(s) Out[8]: bytes

我的理解, 它表明的就是以字节(byte)为单位存储的二进制, i.e. 一坨的bytesunicode

3. Encoding/decoding:

搞清楚python中的str和bytes类型, 这个问题就迎刃而解了.文档

Encoding:
str → bytes
由于str只是一堆unicode字符(数字).
因此简单的说, encoding就是把一堆数字, 按特定的编码算法X(例如utf-8), 用字节的方式存储在计算机上.字符串
Decoding:
bytes → str
举个栗子🌰:

 
      In [9]: s = '哈哈'  In [10]: s.encode('utf-8') Out[10]: b'\xe5\x93\x88\xe5\x93\x88'  In [11]: s.encode().decode('utf-8') Out[11]: '哈哈'

4. UTF-8编码(encoding)

简单的说下unicode是如何经过utf-8编码转化为bytes, 以帮助更好的理解什么是编码(encoding).
utf-8其实属于动态长度编码(variable length encoding).

举个动态长度编码简单的栗子, 假如说有这么一个二进制序列:
10010001, 10000001, 10110010, 10110010
咱们就能够利用每一个byte的最后一位(标志位, 1表明继续, 0表明结束), 来判断读几个bytes.

utf-8也是相似的思想, 但不一样于上边, 它是用每一个字节开头的几位, 看成标志位, 以下表所示:

1st Byte	2nd Byte	3rd Byte	4th Byte	可用的Bits	最大值
0xxxxxxx				7	007F hex (127)
110xxxxx	10xxxxxx			(5+6)=11	07FF hex (2047)
1110xxxx	10xxxxxx	10xxxxxx		(4+6+6)=16	FFFF hex (65535)
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx	(3+6+6+6)=21	10FFFF hex (1,114,111)

(生动活泼形象的编码例子见下图↓)

总结

为此我专门画了一张图, 总结了一下:

 
       'unicode: 01010110 00111111'  +--- _str = '嘿' <---+  | | encoding | | decoding  | |  +---> _bytes = b'\xe5\x98\xbf' ----+  'utf-8: (1110)0101 (10)011000 (10)11 1111'

!注意utf-8编码中我用括号括起来的部分, 去对照上边的表格(第三排).

1. python 编码问题
2. python编码问题
3. Python编码问题
4. Python 编码问题
5. python——编码问题
6. python的anaconda编码问题
7. python中的编码问题
8. Python中的编码问题
9. Python的编码问题
10. Requests 库编码问题及引出的 Python 编码问题
更多相关文章...
• XML 编码 - XML 教程
• SQLite - Python - SQLite教程
• PHP Ajax 跨域问题最佳解决方案
• IntelliJ IDEA中SpringBoot properties文件不能自动提示问题解决

Python的编码问题

提示:

编码在python2和3中的区别(可跳过, 最后回过头来看):

1. Python3 str类型(unicode)

2. Python3 bytes类型(字节)

3. Encoding/decoding:

4. UTF-8编码(encoding)

总结