python处理一些乱码的中文文本时decode('utf-8')报错的处理

时间 2019-11-13

标签 python 处理一些乱码中文文本 decode utf 报错栏目 Python 繁體版

原文原文链接

用python写脚本时，遇处处理中文（乱码的中文）时，用decode('utf-8')会发现始终会报错html

>>> txt_from = open('/home/love/ex130705.log')
>>> txt_from_iter= iter(txt_from)
>>> txt_proc = txt_from_iter.next().decode('utf-8', 'ignore')

 Traceback (most recent call last):
  File "/tmp/py4049kjX", line 41, in <module>
    txt_proc = txt_from_iter.next().decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 84-85: invalid continuation byte

欲处理的原文件中部分显示为乱码：python

2013-07-05 04:20:10 192.168.1.5 GET /Portals/0/鏁欒偛淇℃伅鏂囦欢澶校园<E5><BF> 80 - 25.XXX.10.99 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+5.1;+Trident/4.0;+Alexa+Toolbar) 404 0 2 234python2.7

2013-07-05 04:20:24 192.168.1.5 GET /Portals/0/鏁欒偛淇℃伅鏂囦欢澶校园<E5><BF> 80 - 25.XXX.10.99 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+5.1;+Trident/4.0;+Alexa+Toolbar) 404 0 2 296ide

这些显示乱码的中文字符是IIS在记录日志过程当中出现的。python经过decode('utf-8')解码为UTF-8时会抛出异常UnicodeDecodeError。日志

解决：用 decode('utf-8', 'ignore')code

>>>
>>> txt_proc = txt_from_iter.next().decode('utf-8', 'ignore')
>>>

查看decode的帮助：htm

help("".decode)
decode(...)
    S.decode([encoding[,errors]]) -> object
    
    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.

参考：http://blog.sina.com.cn/s/blog_8af1069601015et3.htmlblog