Python Unicode与中文处理(转)

时间 2019-11-11

标签 python unicode 中文处理栏目 Python 繁體版

原文原文链接

Python Unicode与中文处理

python中的unicode是让人很困惑、比较难以理解的问题，本文力求完全解决这些问题；html

1.unicode、gbk、gb23十二、utf-8的关系；python

http://www.pythonclub.org/python-basic/encode-detail 这篇文章写的比较好，utf-8是unicode的一种实现方式，unicode、gbk、gb2312是编码字符集；网络

2.python中的中文编码问题；ide

2.1 .py文件中的编码函数

Python 默认脚本文件都是 ANSCII 编码的，当文件中有非 ANSCII 编码范围内的字符的时候就要使用"编码指示"来修正。一个module的定义中，若是.py文件中包含中文字符（严格的说是含有非anscii字符），则须要在第一行或第二行指定编码声明：测试

# -*- coding=utf-8 -*-或者 #coding=utf-8 其余的编码如：gbk、gb2312也能够；不然会出现相似:SyntaxError: Non-ASCII character '\xe4' in file ChineseTest.py on line 1, but no encoding declared; see http://www.pytho for details这样的异常信息；n.org/peps/pep-0263.htmlthis

2.2 python中的编码与解码google

先说一下python中的字符串类型，在python中有两种字符串类型，分别是str和unicode，他们都是basestring的派生类；str类型是一个包含Characters represent (at least) 8-bit bytes的序列；unicode的每一个unit是一个unicode obj;因此：编码

len(u'中国')的值是2；len('ab')的值也是2；url

在str的文档中有这样的一句话：The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. 也就是说在读取一个文件的内容，或者从网络上读取到内容时，保持的对象为str类型；若是想把一个str转换成特定编码类型，须要把str转为 Unicode,而后从unicode转为特定的编码类型如：utf-八、gb2312等；

python中提供的转换函数：

unicode转为 gb2312,utf-8等

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = u'中国'
s_gb = s.encode('gb2312')

utf-8,GBK转换为unicode 使用函数unicode(s,encoding) 或者s.decode(encoding)

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = u'中国'

#s为unicode先转为utf-8

s_utf8 = s.encode('UTF-8')

assert(s_utf8.decode('utf-8') == s)

普通的str转为unicode

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = '中国'

su = u'中国''

#s为unicode先转为utf-8

#由于s为所在的.py(# -*- coding=UTF-8 -*-)编码为utf-8

s_unicode = s.decode('UTF-8')

assert(s_unicode == su)

#s转为gb2312,先转为unicode再转为gb2312

s.decode('utf-8').encode('gb2312')

#若是直接执行s.encode('gb2312')会发生什么？

s.encode('gb2312')

# -*- coding=UTF-8 -*-

if __name__ == '__main__':
s = '中国'

#若是直接执行s.encode('gb2312')会发生什么？

s.encode('gb2312')

这里会发生一个异常：

Python 会自动的先将 s 解码为 unicode ，而后再编码成 gb2312。由于解码是python自动进行的，咱们没有指明解码方式，python 就会使用 sys.defaultencoding 指明的方式来解码。不少状况下 sys.defaultencoding 是 ANSCII，若是 s 不是这个类型就会出错。
拿上面的状况来讲，个人 sys.defaultencoding 是 anscii，而 s 的编码方式和文件的编码方式一致，是 utf8 的，因此出错了: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
对于这种状况，咱们有两种方法来改正错误：
一是明确的指示出 s 的编码方式
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = '中文'
s.decode('utf-8').encode('gb2312')
二是更改 sys.defaultencoding 为文件的编码方式

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import sys
reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法，咱们须要从新载入
sys.setdefaultencoding('utf-8')

str = '中文'
str.encode('gb2312')

文件编码与print函数
创建一个文件test.txt，文件格式用ANSI，内容为:
abc中文
用python来读取
# coding=gbk
print open("Test.txt").read()
结果：abc中文
把文件格式改为UTF-8：
结果：abc涓枃
显然，这里须要解码：
# coding=gbk
import codecs
print open("Test.txt").read().decode("utf-8")
结果：abc中文
上面的test.txt我是用Editplus来编辑的，但当我用Windows自带的记事本编辑并存成UTF-8格式时，
运行时报错：
Traceback (most recent call last):
File "ChineseTest.py", line 3, in <module>
print open("Test.txt").read().decode("utf-8")
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in position 0: illegal multibyte sequence

原来，某些软件，如notepad，在保存一个以UTF-8编码的文件时，会在文件开始的地方插入三个不可见的字符（0xEF 0xBB 0xBF，即BOM）。
所以咱们在读取时须要本身去掉这些字符，python中的codecs module定义了这个常量：
# coding=gbk
import codecs
data = open("Test.txt").read()
if data[:3] == codecs.BOM_UTF8:
data = data[3:]
print data.decode("utf-8")
结果：abc中文

（四）一点遗留问题
在第二部分中，咱们用unicode函数和decode方法把str转换成unicode。为何这两个函数的参数用"gbk"呢？
第一反应是咱们的编码声明里用了gbk(# coding=gbk)，但真是这样？
修改一下源文件：
# coding=utf-8
s = "中文"
print unicode(s, "utf-8")
运行，报错：
Traceback (most recent call last):
File "ChineseTest.py", line 3, in <module>
s = unicode(s, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
显然，若是前面正常是由于两边都使用了gbk，那么这里我保持了两边utf-8一致，也应该正常，不至于报错。
更进一步的例子，若是咱们这里转换仍然用gbk：
# coding=utf-8
s = "中文"
print unicode(s, "gbk")
结果：中文
翻阅了一篇英文资料，它大体讲解了python中的print原理：
When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

To print data reliably, you must know the encoding that this display program expects.

简单地说，python中的print直接把字符串传递给操做系统，因此你须要把str解码成与操做系统一致的格式。Windows使用CP936(几乎与gbk相同)，因此这里可使用gbk。
最后测试：
# coding=utf-8
s = "中文"
print unicode(s, "cp936")
结果：中文

特别推荐：

python 编码检测

使用 chardet 能够很方便的实现字符串/文件的编码检测

例子以下:

>>>
import
urllib

>>>
rawdata = urllib
.urlopen
(
'http://www.google.cn/'
)
.read
(
)

>>>
import
chardet

>>>
chardet.detect
(
rawdata)

{
'confidence'
: 0.98999999999999999
, 'encoding'
: 'GB2312'
}

>>>

chardet 下载地址 http://chardet.feedparser.org/

特别提示：

在工做中，常常遇到，读取一个文件，或者是从网页获取一个问题，明明看着是gb2312的编码，但是当使用decode转时，老是出错，这个时候，可使用decode('gb18030')这个字符集来解决，若是仍是有问题，这个时候，必定要注意，decode还有一个参数，好比，若要将某个 String对象s从gbk内码转换为UTF-8，能够以下操做 s.decode('gbk').encode('utf-8′) 但是，在实际开发中，我发现，这种办法常常会出现异常： UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence 这是由于遇到了非法字符——尤为是在某些用C/C++编写的程序中，全角空格每每有多种不一样的实现方式，好比\xa3\xa0，或者\xa4\x57，这些字符，看起来都是全角空格，但它们并非“合法”的全角空格（真正的全角空格是\xa1\xa1），所以在转码的过程当中出现了异常。这样的问题很让人头疼，由于只要字符串中出现了一个非法字符，整个字符串——有时候，就是整篇文章——就都没法转码。解决办法： s.decode('gbk', ‘ignore').encode('utf-8′) 由于decode的函数原型是decode([encoding], [errors='strict'])，能够用第二个参数控制错误处理的策略，默认的参数就是strict，表明遇到非法字符时抛出异常；若是设置为ignore，则会忽略非法字符；若是设置为replace，则会用?取代非法字符；若是设置为xmlcharrefreplace，则使用XML的字符引用。