Python str() 引起的 UnicodeEncodeError

时间 2019-11-22

标签 python str 引起 unicodeencodeerror 栏目 Python 繁體版

原文原文链接

原由

众所周知，Python 2 中的 UnicodeEncodeError 与 UnicodeDecodeError 是比较棘手的问题，有时候遇到这类问题的发生，老是一头雾水，感受莫名其妙。甚至，《Fluent Python》的做者还提出了所谓“三明治模型”的东西来帮助解决此类问题（其实大可没必要如此麻烦，后文有述）。python

今天在线上遇到一个与此有关的小问题，感受颇有趣，水文一篇记录之。bash

Bug 转到我这里时，看到现象天然是UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)这类莫名其妙的提示。而后翻 log，迅速找到对应的代码行，大概相似下面这种：网络

thrift_obj = ThriftKeyValue(key=str(xx_obj.name))  # 出错行, xx_obj.name 是一个 str
复制代码

一开始，看见str(xx_obj.name)，也不知道是手误，仍是故意为之，反正是学不会这种操做（应该每一个项目里面，或多或少都有这样的神奇代码吧）。函数

分析

看异常的字面意思，大体就是：有某个串，正在被 ASCII 编码器编码，可是显然该串超出了 ASCII 编码器所规定的范围，因而出错。因而推测：ui

哪里应该有个什么Unicode串（什么串无所谓，反正只要超出 ASCII 的范围就行），这里应该是 xx_obj.name。
某处正在发生编码动做，并且是偷偷地在搞（最烦这种隐式转换了，Python 2 中不少），从代码看不出在哪里。

左看右看，应该是 str() 这个内置函数，因而简单地试了一下以下代码：this

In [5]: u = u'中国'

In [6]: str(u)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-6-b3b94fb7b5a0> in <module>()
----> 1 str(u)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) In [7]: b = u.encode('utf-8') In [8]: str(b) Out[8]: '\xe4\xb8\xad\xe5\x9b\xbd' 复制代码

果真如此。查阅文档一看，没啥有价值的信息，描述太模糊了：编码

class str(object='')
Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that str(object) does not always attempt to return a string that is acceptable to eval(); its goal is to return a printable string. If no argument is given, returns the empty string, ''.

For more information on strings see Sequence Types — str, unicode, list, tuple, bytearray, buffer, xrange which describes sequence functionality (strings are sequences), and also the string-specific methods described in the String Methods section. To output formatted strings use template strings or the % operator described in the String Formatting Operations section. In addition see the String Services section. See also unicode().
复制代码

咱们的代码里面（Python 2），每一个 py 文件都有这么一行：spa

from __future__ import unicode_literals, absolute_import
复制代码

因此我推测 xx_obj.name 是要给 unicode 串，打 log 一看，果真如此。code

解决

至此，要么将 xx_obj.name 转化成 str() 能认识的东西，在这里至少不能是 unicode，应该是 bytes。不过我没有这么作，太丑陋了，二是改为这样：orm

thrift_obj = ThriftKeyValue(key=xx_obj.name) # 这里不必调用 str() ，估计前面能跑正常，是由于 name 刚好老是 ASCII 字符
复制代码

Bug 修复，其余功能也表现正常。

总结

前文讲到，Python 2 中有较多这种隐式转换，并且也没啥文档说明，特别是加上 Windows环境和 print 操做时，报错信息更是看得人不明因此。《Fluent Python》中有讲到所谓“三明治模型”来解决这一问题，仍是蛮有启发的。

不过，我通常遵循的原则是：只用 Unicode，让任何地方都是 Unicode。方式以下：

全部 py 文件必须有以下文件头：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#

from __future__ import unicode_literals, absolute_import
复制代码

接到外界的字节串（从网络，从文件等），先转成 Unicode，不过抽取成函数更好,省得重复编码：

API 的起名优势冗余，主要是为了作到 “见名知义”


class UnicodeUtils(object):
    @classmethod
    def get_unicode_str(cls, bytes_str, try_decoders=('utf-8', 'gbk', 'utf-16')):
        """转换成字符串(通常是Unicode)"""
        
        if not bytes_str:
            return u''

        if isinstance(bytes_str, (unicode,)):
            return bytes_str

        for decoder in try_decoders:
            try:
                unicode_str = bytes_str.decode(decoder)
            except UnicodeDecodeError:
                pass
            else:
                return unicode_str

        raise DecodeBytesFailedException('decode bytes failed. tried decoders: %s' % list(try_decoders))

    @classmethod
    def encode_to_bytes(cls, unicode_str, encoder='utf-8'):
        """转换成字节串"""
        
        if unicode_str is None:
            return b''

        if isinstance(unicode_str, unicode):
            return unicode_str.encode(encoding=encoder)
        else:
            u = cls.get_unicode(unicode_str)
            return u.encode(encoding=encoder)
复制代码

送到外界的东西，所有转成 UTF-8 编码的字节串，见上面代码