python 2和3 字符编码

时间 2019-11-08

标签 python 字符编码栏目 Python 繁體版

原文原文链接

在字符编码问题上,python2 和python3 仍是有点不一样的.今日写篇博客,完全理清这个问题..python

字符编码问题的由来:ide

　　这要从计算发展历史来看待这个问题了,一开始,歪果仁使用ASCII码,8位(仅仅使用了7位,126个字符),一个字节,就把本身语言中全部基本字符都囊括在内,并无考虑到别的国家字符太多,一个ASCII不够用的状况...函数

随着计算机的在全世界的普及,本来的ASCII不能适应,因而在ASCII基础上,诞生了unicode编码(万国码),占用2个字节.全部的字符都包含了,编码不一样形成的乱码问题就解决了.优化

　　可是,所有使用了unicode编码,又带来一个问题,就是浪费...编码

　　本着不浪费的原则,又诞生出可变长编码utf8,UTF-8编码把一个Unicode字符根据不一样的数字大小编码成1-6个字节，经常使用的英文字母被编码成1个字节，汉字一般是3个字节，只有很生僻的字符才会被编码成4-6个字节。spa

　　UTF-8编码有一个额外的好处，就是ASCII编码实际上能够被当作是UTF-8编码的一部分，因此，大量只支持ASCII编码的历史遗留软件能够在UTF-8编码下继续工做。3d

　　如上图,在python2中,若是须要把utf8 转换成gbk,须要经过unicode中转.其中encode编码,decode解码..code

　　utf8 --> unicode -->gbkblog

或者gbk --> unicode -->utf8utf-8

python2代码演示:

1 s="字符编码问题"  ##utf8编码格式
2 s_to_unicode = s.decode("utf-8")
3 print(s_to_unicode) ##字符编码问题(unicode)
4 print(s)            ##字符编码问题(utf8)
5 print(type(s_to_unicode),type(s),s_to_unicode,s) ##输出一个,(<type 'unicode'>, <type 'str'>),(u'\u5b57\u7b26\u7f16\u7801\u95ee\u9898', '\xe5\xad\x97\xe7\xac\xa6\xe7\xbc\x96\xe7\xa0\x81\xe9\x97\xae\xe9\xa2\x98')
6 
7 unicode_to_gbk = s_to_unicode.encode("gbk")
8 print(unicode_to_gbk)  ##�ַ���������

python2 编码转换

python3 帮咱们作了个优化,编码转换不须要在通过unicode了,utf8直接转成gbk,或者gbk直接转成utf8

utf8 --> gbk

gbk -->utf8

python 3代码演示:

1 s="字符编码问题python 3"  ##utf8编码格式
2 s_to_gbk = s.encode("gbk")
3 print(s_to_gbk) ## b'\xd7\xd6\xb7\xfb\xb1\xe0\xc2\xeb\xce\xca\xcc\xe2python 3' 
4 
5 s_to_gbk = s.encode("utf8") ##b'\xe5\xad\x97\xe7\xac\xa6\xe7\xbc\x96\xe7\xa0\x81\xe9\x97\xae\xe9\xa2\x98python 3'
6 print(s_to_gbk)

python3 编码转换

python3中默认使用的编码utf8,utf8中,每一个汉字占用3个字节,gbk中,每一个汉字占用2个字节从上面代码结果显示,能够很直观的得出..

python2中,循环读出字节,而python3中循环读出字符.

代码演示: ps:文件的编码格式和字符串的编码格式以及终端的编码格式一致才能正常的输出想要的字符串。这里单独print(i),会不正常显示..

 1 name="字符"
 2 for i in name:
 3     # print(i)
 4     print(type(i),i)
 5     # print(i)
 6 #结果:
 7 (<type 'str'>, '\xe5')
 8 (<type 'str'>, '\xad')
 9 (<type 'str'>, '\x97')
10 (<type 'str'>, '\xe7')
11 (<type 'str'>, '\xac')
12 (<type 'str'>, '\xa6')

python2

代码演示:

 1 s="字符编码问题python 3"  ##utf8编码格式
 2 for i in s:
 3     print(i)
 4 #结果:
 5 字
 6 符
 7 编
 8 码
 9 问
10 题
11 p
12 y
13 t
14 h
15 o
16 n
17 
18 3

python3

python3 比pythn2 更加友好,更加高级..因此下面仍是用python3吧.

Python 3最重要的新特性大概要算是对文本和二进制数据做了更为清晰的区分。文本老是Unicode，由str类型表示，二进制数据则由bytes类型表示。Python 3不会以任意隐式的方式混用str和bytes，正是这使得二者的区分特别清晰。你不能拼接字符串和字节包，也没法在字节包里搜索字符串（反之亦然），也不能将字符串传入参数为字节包的函数（反之亦然）。

字符串能够编码成字节包，而字节包能够解码成字符串。

演示代码:

1 res1 = '€20'.encode('utf-8')
2 res2 = b'\xe2\x82\xac20'.decode('utf-8')
3 print(res1,res2)
4 结果:
5 b'\xe2\x82\xac20' €20

演示代码

先介绍2个函数: bytes() 和str()

bytes():

做用:将字符转成字节,

 1     def __init__(self, value=b'', encoding=None, errors='strict'): # known special case of bytes.__init__
 2         """
 3         bytes(iterable_of_ints) -> bytes
 4         bytes(string, encoding[, errors]) -> bytes
 5         bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
 6         bytes(int) -> bytes object of size given by the parameter initialized with null bytes
 7         bytes() -> empty bytes object
 8         
 9         Construct an immutable array of bytes from:
10           - an iterable yielding integers in range(256)
11           - a text string encoded using the specified encoding
12           - any object implementing the buffer API.
13           - an integer
14         # (copied from class doc)
15         """
16         pass

bytes()

str():

    def __init__(self, value='', encoding=None, errors='strict'): # known special case of str.__init__
        """
        str(object='') -> str
        str(bytes_or_buffer[, encoding[, errors]]) -> str
        
        Create a new string object from the given object. If encoding or
        errors is specified, then the object must expose a data buffer
        that will be decoded using the given encoding and error handler.
        Otherwise, returns the result of object.__str__() (if defined)
        or repr(object).
        encoding defaults to sys.getdefaultencoding().
        errors defaults to 'strict'.
        # (copied from class doc)
        """
        pass

str()

一个简单的题目:将你的名字,转成2进制显示出来(python 3)

 1 name = "张三"
 2 for i in name:
 3     i_bytes = bytes(i,encoding='utf8')
 4     for i in i_bytes:
 5         print(bin(i))
 6 结果:
 7 0b11100101
 8 0b10111100
 9 0b10100000
10 0b11100100
11 0b10111000
12 0b10001001

题目代码