Python 普通str字符串和 unicode 字符串及字符串编码探测、转换

时间 2019-11-09

标签 python 普通 str 字符串 unicode 编码探测转换栏目 Python 繁體版

原文原文链接

本文研究时的环境是CentOS release 6.4，内核版本2.6.32-358.el6.x86_64，python2.6.6

内容：关于字符串的两个魔术方法__str__() 、__unicode__() 两个函数str() 、unicode() 类型转换encode 、decode 和编码探测chardet、 cchardethtml

先看一下对象的两个魔术方法

第一个：object.__str__(self)python

Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object.The return value must be a string object.shell

被内建函数str() 和 print语句调用，产生非正式的对对象的描述字符串。返回值必须是string对象（这里指的应该是bytes object字节对象）数组

第二个：object.__unicode__(self)网络

Called to implement unicode() built-in; should return a Unicode object. When this method is not defined,string conversion is attempted, and the result of string conversion is converted to Unicode using the system default encoding.app

被内建函数unicode()调用；应当返回一个Unicode对象。当没有定义此方法时，将会尝试字符串转换，字符串转换的结果是：使用系统默认编码将其转换为Unicode string。ide

str() 和 unicode()

str(object='')函数

Return a string containing a nicely printable representation of an object. For strings, this returns the string itself.If no argument is given, returns the empty string, ''.ui

返回对传入对象的便于打印的描述的字符串（调用对象的__str__()方法）。对于字符串对象（字节对象）将会返回他自己。若是没有参数，将返回空字符串。this

unicode(object='')

unicode(object[, encoding[, errors]])

If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if object is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.

若是没有提供可选参数，unicode()将会模拟str()的行为，只是返回的是Unicode strings而不是8-bit strings。更准确的状况，若是传入的对象是Unicode string或他的子类，将不会进行任何译码操做，直接返回它自己。

For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.

对于提供了__unicode__()方法的对象，将不带参数调用此方法。其余状况下传入的必须是8-bit字符串描述，然后用编码解码器用系统默认编码译码为Unicode string。

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according toerrors; this specifies the treatment of characters which are invalid in the input encoding. If errors is'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.

简单的说就是将传入的8-bit字串或缓冲区用指定的编码解码生成Unicode string，error参数用来指定没法解码时的处理方式。

小结：

str():调用对象的__str__()方法，产生8-bit string

unicode()：调用对象的__unicode__()方法，返回Unicode string。如过对象没有__unicode__()方法就调用__str__()方法生成8-bit string（传入的若是就是8-bit string将省略这一步），而后对其用系统默认编码解码，生成Unicode string。

可见，咱们本身作的类，最好仍是提供__str__()和__unicode__()方法，对于写日志，debug等是颇有用的。

注：查看python系统默认编码（上文中unicode()将会使用的默认值）

import sys
print sys.getdefaultencoding()

在交互模式设置字符编码

>>> reload(sys) # 这个很重要，不然报错
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf8')
>>> sys.getdefaultencoding()
'utf8'

普通string和Unicode string的区别

首先确认系统环境设置的是UTF-8：
echo $LANG
en_US.UTF-8

而后确认使用的终端调成utf-8编码不然有些实验会混乱。

1.普通string能够理解为咱们平时理解的字符串，一个缓冲区里边放着字符串，内容多是各类类型的编码。（我是把它看成C语言char型数组理解的）

>>> str_string = '你好，字符编码'

str_string的内容多是：

utf8: '\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe7\xbc\x96\xe7\xa0\x81'

utf16: '\xff\xfe`O}Y\x0c\xff\x16\x7f\x01x'

utf32: '\xff\xfe\x00\x00`O\x00\x00}Y\x00\x00\x0c\xff\x00\x00\x16\x7f\x00\x00\x01x\x00\x00'

gb2312： '\xc4\xe3\xba\xc3\xa3\xac\xb1\xe0\xc2\xeb'

2. unicode是一种对象，用unicode字符集保存字符串（内部存储的是UCS2或UCS4，更底层依据操做系统的环境，是用wchar_t、unsigned short或unsigned long。详情可在python文档中搜索Encodings and Unicode及Py_UNICODE，）

>>> unicode_string = u'你好，字符编码'
>>> unicode_string
u'\u4f60\u4f1a\u597d\uff0c\u5b57\u7b26\u7f16\u7801'

下面这种状况（unicode应该是u'里边是\uxxxx的格式，但这个倒是\x的），应该是系统编码与终端（本人用的Xshell）编码不一致形成，其实u'\xe4' == u'\u00e4'，这样容易形成乱码，请避免这种环境。

>>> unicode_string = u'你好，字符编码'
>>> unicode_string
u'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe5\xad\x97\xe7\xac\xa6\xe7\xbc\x96\xe7\xa0\x81'

参考http://stackoverflow.com/questions/9845842/bytes-in-a-unicode-python-string

小结：

普通字符串（8-bit string，字节字符串）：

是用挨着的一个一个8位的二进制位保存字符串，是有编码的区别的。

Unicode string：

是用UCS2（或UCS4编译时决定）保存字符串的对象，是没有编码区别的，用它能够生成各类编码的普通字符串。

encode和decode的使用

上边两种字符串的区别弄明白了这两个函数就好理解了

encode是编码，decode是解码。

Unicode字符串要变成普通字符串就要用某种编码去“编码”（encode）。

普通字符串须要知道它自己是什么编码的，用此编码来“解码”（decode）才能生成Unicode字符串对象。

因此：

encode的使用

>>> a = u'你好，世界！'
>>> a
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8')
'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8c\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x81'
>>> a.encode('utf16')
'\xff\xfe`O}Y\x0c\xff\x16NLu\x01\xff'
>>> a.encode('gb2312')
'\xc4\xe3\xba\xc3\xa3\xac\xca\xc0\xbd\xe7\xa3\xa1'
>>> a.encode('ascii') # 由于ascii字符集没法表示中文，因此会报错，字符串是u'Hello,World!'就好了
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
>>>

decode的使用

>>> b = a.encode('utf8')#先用a生成某中编码的普通字符串，而后进行decode，注意编码必须对应！
>>> b.decode('utf8')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> b = a.encode('utf16')
>>> b.decode('utf16')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> b = a.encode('gb2312')
>>> b.decode('gb2312')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> b = a.encode('utf8')
>>> b.decode('gb2312')#编码不对应的状况，b是utf8编码的字符串，用gb2312是不能解码的。
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 2-3: illegal multibyte sequence
>>> b.decode('gb2312','ignore')#这时候第二个参数出场了，能够设置忽略或者用问号替换，防止抛出异常
u'\u6d63\u30bd\u951b'
>>> print b.decode('gb2312','ignore')
浣ソ锛
>>> b.decode('gb2312','replace')
u'\u6d63\ufffd\u30bd\u951b\ufffd\ufffd\ufffd\ufffd\ufffd'
>>> print b.decode('gb2312','replace')
浣�ソ锛�����
>>>

来个混合使用的：-）

>>> a = u'你好，世界！'
>>> a
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8').decode('utf8')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8').decode('utf8').encode('gb2312')
'\xc4\xe3\xba\xc3\xa3\xac\xca\xc0\xbd\xe7\xa3\xa1'
>>> a.encode('utf8').decode('utf8').encode('gb2312').decode('gb2312')
u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'
>>> a.encode('utf8').decode('utf8').encode('gb2312').decode('gb2312').encode('utf16')
'\xff\xfe`O}Y\x0c\xff\x16NLu\x01\xff'
>>> a.encode('utf8').decode('utf8').encode('gb2312').decode('gb2312').encode('utf16').decode('utf16')u'\u4f60\u597d\uff0c\u4e16\u754c\uff01'

python普通字符串编码检测

普通字符串有编码的分别，咱们常常遇到经过网络或打开某个文件读取字符串的状况，而若是对端不是咱们本身的程序，用的什么编码还真很差说，这就涉及到字符串编码检测了。

提早声明：

1.理论上是没法100%检测出是什么编码的，由于各类编码间存在冲突，同一个编码可能不一样的字符集里都出现了，可是表明不一样的字符，这里只能说检测字符串最多是什么编码。（好比虽然你是用utf8编码的‘abc’，可是会被探测为ascii，由于ascii表示a、b、c的编码和utf8同样，这是utf8对ascii的兼容特性，其余各类编码不少也有这种兼容ascii的设计，这就是全屏乱码，字母数字下划线不会乱的缘由所在，由于用任何编码解码都能正确显示abc123）

2.只能探测出“测码库”支持的编码，不支持的编码就无能为力了。（设想你本身制定的私有编码，别人怎么能检测出？）

3.被探测的样本字符串字符数越多越准确，太少了不行，两三个汉字的gb2312字符串是不能正确检测出的。

言归正传，我们开始解码！！

先去pypi下载检测字符编码的库chardet或cchardet（文档说它更快，可是依赖另外的一个库）

https://pypi.python.org/pypi?%3Aaction=search&term=chardet&submit=search

来自chardet文档的例子：

The easiest way to use the Universal Encoding Detector library is with the detect function.

>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}

更高级的例子：

If you’re dealing with a large amount of text, you can call the Universal Encoding Detector library incrementally, and it will stop as soon as it is confident enough to report its results.

import urllib
from chardet.universaldetector import UniversalDetector

usock = urllib.urlopen('http://yahoo.co.jp/')
detector = UniversalDetector()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print detector.result
{'encoding': 'EUC-JP', 'confidence': 0.99}

还有一个：

If you want to detect the encoding of multiple texts (such as separate files)

import glob
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
for filename in glob.glob('*.xml'):
    print filename.ljust(60),
    detector.reset()
    for line in file(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print detector.result

附：一篇比较好的关于python编码的文章，英文的但很易懂。

Making Sense of Python Unicode

Python 普通str字符串 和 unicode 字符串 及字符串编码探测、转换