Requests 库编码问题及引出的 Python 编码问题

时间 2019-11-05

原文原文链接

Requests 编码

在使用 requests 访问微信接口的时候，requests 只根据 http headers 的信息来设置编码集，文档以下：git

response.text()
Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using chardet.
The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

这边就是说，咱们的选择还有，当服务器不指定编码集时，使用如下方式指定编码，而后再将 text 输出，输出的为 unicode。github

r.encoding = 'utf-8'
r.text

关于这个的话题的讨论能够看这里。当咱们使用urllib2.urlopen('http://www.baidu.com').read()时，返回的则是 str 格式。服务器

Python 2 编码问题

# Python 2 默认赋值时为字节串即 str 类型，此处的哈哈通过 utf-8 编码之后变成了 \xe5\x93\x88\xe5\x93\x88，此时 len(x) == 6
>>> x="哈哈"
>>> x
'\xe5\x93\x88\xe5\x93\x88'
>>> type(x)
<type 'str'>
# 因为储存哈哈到 str 类型时通过了 utf-8 编码，因此要想得到哈哈，就必须经过解码，解码后获得 unicode 类型的字符串
>>> x.decode('utf-8')
u'\u54c8\u54c8'
# 呵呵在储存的时候 u 指定了它是 unicode 类型，因此变量 y 是真正意义上的字符串，咱们能够经过 encode 操做将它转换为 str 类型
>>> y=u"呵呵"
>>> y
u'\u5475\u5475'
>>> type(y)
<type 'unicode'>
>>> y.encode('utf-8')
'\xe5\x91\xb5\xe5\x91\xb5'
>>> type(y.encode('utf-8'))
<type 'str'>

Python 3 编码问题

>>> x='哈哈'
# Python 3 中的 str 类型储存的实际上是 Python 2 中的 unicode 字符串，便是真正意义上的字符串
>>> type(x)
<class 'str'>
# 经过 Python 2 同样的方法，咱们能够将一个 unicode 转换为一个 bytes 字节串，这里 bytes 其实就是 Python 2 中的 str 类型。
>>> y = x.encode('utf=8')
>>> y
b'\xe5\x93\x88\xe5\x93\x88'
>>> type(y)
<class 'bytes'>
>>>

总结

Python 2 中 str 和 Python 3 中 bytes 是一个东西
Python 2 中 unicode 和 Python 3 中 str 是一个东西
字符串编码后获得字节串，字节串解码后获得字符串
打开文件使用 codecs.open() 能够指定编码格式