Python之Requests的高级用法

时间 2019-11-11

标签 python requests 高级用法栏目 Python 繁體版

原文原文链接

# 高级用法

本篇文档涵盖了Requests的一些更加高级的特性。html

## 会话对象

会话对象让你可以跨请求保持某些参数。它也会在同一个Session实例发出的全部请求之间保持cookies。python

会话对象具备主要的Requests API的全部方法。linux

咱们来跨请求保持一些cookies:git

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

会话也可用来为请求方法提供缺省数据。这是经过为会话对象的属性提供数据来实现的:github

s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent
s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

任何你传递给请求方法的字典都会与已设置会话层数据合并。方法层的参数覆盖会话的参数。web

从字典参数中移除一个值
有时你会想省略字典参数中一些会话层的键。要作到这一点，你只需简单地在方法层参数中将那个键的值设置为 None ，那个键就会被自动省略掉。json

包含在一个会话中的全部数据你均可以直接使用。学习更多细节请阅读会话API文档。api

## 请求与响应对象

任什么时候候调用requests.*()你都在作两件主要的事情。其一，你在构建一个 Request 对象，该对象将被发送到某个服务器请求或查询一些资源。其二，一旦 requests 获得一个从服务器返回的响应就会产生一个 Response 对象。该响应对象包含服务器返回的全部信息，也包含你原来建立的 Request 对象。以下是一个简单的请求，从Wikipedia的服务器获得一些很是重要的信息:浏览器

>>> r = requests.get('http://en.wikipedia.org/wiki/Monty_Python')

若是想访问服务器返回给咱们的响应头部信息，能够这样作:bash

>>> r.headers
{'content-length': '56170', 'x-content-type-options': 'nosniff', 'x-cache':
'HIT from cp1006.eqiad.wmnet, MISS from cp1010.eqiad.wmnet', 'content-encoding':
'gzip', 'age': '3080', 'content-language': 'en', 'vary': 'Accept-Encoding,Cookie',
'server': 'Apache', 'last-modified': 'Wed, 13 Jun 2012 01:33:50 GMT',
'connection': 'close', 'cache-control': 'private, s-maxage=0, max-age=0,
must-revalidate', 'date': 'Thu, 14 Jun 2012 12:59:39 GMT', 'content-type':
'text/html; charset=UTF-8', 'x-cache-lookup': 'HIT from cp1006.eqiad.wmnet:3128,
MISS from cp1010.eqiad.wmnet:80'}

然而，若是想获得发送到服务器的请求的头部，咱们能够简单地访问该请求，而后是该请求的头部:

>>> r.request.headers
{'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*', 'User-Agent': 'python-requests/0.13.1'}

## Prepared Request

当你从API调用或Session调用获得一个Response对象，对于这个的request属性其实是被使用的PreparedRequest，在某些状况下你可能但愿在发送请求以前对body和headers(或其余东西)作些额外的工做，一个简单的例子以下:

from requests import Request, Session

s = Session()
req = Request('GET', url,
    data=data,
    headers=header
)
prepped = req.prepare()

# do something with prepped.body
# do something with prepped.headers

resp = s.send(prepped,
    stream=stream,
    verify=verify,
    proxies=proxies,
    cert=cert,
    timeout=timeout
)

print(resp.status_code)

由于你没有用Request对象作任何特别的事情，你应该当即封装它和修改 PreparedRequest 对象，而后携带着你想要发送到requests.* 或 Session.*的其余参数来发送它

可是，上面的代码会丧失一些Requests Session对象的优点，特别的，Session层的状态好比cookies不会被应用到你的其余请求中，要使它获得应用，你能够用Session.prepare_request()来替换 Request.prepare()，好比下面的例子:

from requests import Request, Session

s = Session()
req = Request('GET',  url,
    data=data
    headers=headers
)

prepped = s.prepare_request(req)

# do something with prepped.body
# do something with prepped.headers

resp = s.send(prepped,
    stream=stream,
    verify=verify,
    proxies=proxies,
    cert=cert,
    timeout=timeout
)

print(resp.status_code)

## SSL证书验证

Requests能够为HTTPS请求验证SSL证书，就像web浏览器同样。要想检查某个主机的SSL证书，你可使用 verify 参数:

>>> requests.get('https://kennethreitz.com', verify=True)
requests.exceptions.SSLError: hostname 'kennethreitz.com' doesn't match either of '*.herokuapp.com', 'herokuapp.com'

在该域名上我没有设置SSL，因此失败了。但Github设置了SSL:

>>> requests.get('https://github.com', verify=True)
<Response [200]>

对于私有证书，你也能够传递一个CA_BUNDLE文件的路径给 verify 。你也能够设置 REQUEST_CA_BUNDLE 环境变量。

若是你将verify设置为False，Requests也能忽略对SSL证书的验证。

>>> requests.get('https://kennethreitz.com', verify=False)
<Response [200]>

默认状况下， verify 是设置为True的。选项 verify 仅应用于主机证书。

你也能够指定一个本地证书用做客户端证书，能够是单个文件（包含密钥和证书）或一个包含两个文件路径的元组:

>>> requests.get('https://kennethreitz.com', cert=('/path/server.crt', '/path/key'))
<Response [200]>

若是你指定了一个错误路径或一个无效的证书:

>>> requests.get('https://kennethreitz.com', cert='/wrong_path/server.pem')
SSLError: [Errno 336265225] _ssl.c:347: error:140B0009:SSL routines:SSL_CTX_use_PrivateKey_file:PEM lib

## 响应体内容工做流

默认状况下，当你进行网络请求后，响应体会当即被下载。你能够经过 stream 参数覆盖这个行为，推迟下载响应体直到访问 Response.content 属性:

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)

此时仅有响应头被下载下来了，链接保持打开状态，所以容许咱们根据条件获取内容:

if int(r.headers['content-length']) < TOO_LONG:
    content = r.content
    ...

你能够进一步使用 Response.iter_content 和 Response.iter_lines 方法来控制工做流，或者以 Response.raw 从底层urllib3的 urllib3.HTTPResponse <urllib3.response.HTTPResponse 读取。

若是当你请求时设置stream为True，Requests将不能释放这个链接为链接池，除非你读取了所有数据或者调用了Response.close，这样会使链接变得低效率。若是当你设置 stream = True 时你发现你本身部分地读取了响应体数据(或者彻底没读取响应体数据)，你应该考虑使用contextlib.closing,好比下面的例子:

from contextlib import closing

with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
    # Do things with the response here.

## 保持活动状态（持久链接）

好消息 - 归功于urllib3，同一会话内的持久链接是彻底自动处理的！同一会话内你发出的任何请求都会自动复用恰当的链接！

注意：只有全部的响应体数据被读取完毕链接才会被释放为链接池；因此确保将 stream 设置为 False 或读取 Response 对象的 content 属性。

## 流式上传

Requests支持流式上传，这容许你发送大的数据流或文件而无需先把它们读入内存。要使用流式上传，仅需为你的请求体提供一个类文件对象便可:

with open('massive-body') as f:
    requests.post('http://some.url/streamed', data=f)

## 块编码请求

对于出去和进来的请求，Requests也支持分块传输编码。要发送一个块编码的请求，仅需为你的请求体提供一个生成器（或任意没有具体长度(without a length)的迭代器）:

def gen():
    yield 'hi'
    yield 'there'

requests.post('http://some.url/chunked', data=gen())

## POST 多个编码(Multipart-Encoded)文件

你能够在一个请求中发送多个文件，例如，假设你但愿上传图像文件到一个包含多个文件字段‘images’的HTML表单

<input type=”file” name=”images” multiple=”true” required=”true”/>

达到这个目的，仅仅只须要设置文件到一个包含(form_field_name, file_info)的元组的列表：

>>> url = 'http://httpbin.org/post'
>>> multiple_files = [('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
                      ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]
>>> r = requests.post(url, files=multiple_files)
>>> r.text
{
  ...
  'files': {'images': 'data:image/png;base64,iVBORw ....'}
  'Content-Type': 'multipart/form-data; boundary=3131623adb2043caaeb5538cc7aa0b3a',
  ...
}

## 事件挂钩

Requests有一个钩子系统，你能够用来操控部分请求过程，或信号事件处理。

可用的钩子:

response:

从一个请求产生的响应

你能够经过传递一个 {hook_name: callback_function} 字典给 hooks 请求参数为每一个请求分配一个钩子函数:

hooks=dict(response=print_url)

callback_function 会接受一个数据块做为它的第一个参数。

def print_url(r):
    print(r.url)

若执行你的回调函数期间发生错误，系统会给出一个警告。

若回调函数返回一个值，默认以该值替换传进来的数据。若函数未返回任何东西，也没有什么其余的影响。

咱们来在运行期间打印一些请求方法的参数:

>>> requests.get('http://httpbin.org', hooks=dict(response=print_url))
http://httpbin.org
<Response [200]>

## 自定义身份验证

Requests容许你使用本身指定的身份验证机制。

任何传递给请求方法的 auth 参数的可调用对象，在请求发出以前都有机会修改请求。

自定义的身份验证机制是做为 requests.auth.AuthBase 的子类来实现的，也很是容易定义。

Requests在 requests.auth 中提供了两种常见的的身份验证方案： HTTPBasicAuth 和 HTTPDigestAuth 。

假设咱们有一个web服务，仅在 X-Pizza 头被设置为一个密码值的状况下才会有响应。虽然这不太可能，但就以它为例好了

from requests.auth import AuthBase

class PizzaAuth(AuthBase):
    """Attaches HTTP Pizza Authentication to the given Request object."""
    def __init__(self, username):
        # setup any auth-related data here
        self.username = username

    def __call__(self, r):
        # modify and return the request
        r.headers['X-Pizza'] = self.username
        return r

而后就可使用咱们的PizzaAuth来进行网络请求:

>>> requests.get('http://pizzabin.org/admin', auth=PizzaAuth('kenneth'))
<Response [200]>

## 流式请求

使用 requests.Response.iter_lines() 你能够很方便地对流式API（例如 Twitter的流式API ）进行迭代。简单地设置 stream 为 True 即可以使用 iter_lines() 对相应进行迭代:

import json
import requests

r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():

    # filter out keep-alive new lines
    if line:
        print(json.loads(line))

## 代理

若是须要使用代理，你能够经过为任意请求方法提供 proxies 参数来配置单个请求:

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

你也能够经过环境变量 HTTP_PROXY 和 HTTPS_PROXY 来配置代理。

$ export HTTP_PROXY="http://10.10.1.10:3128"
$ export HTTPS_PROXY="http://10.10.1.10:1080"
$ python

>>> import requests
>>> requests.get("http://example.org")

若你的代理须要使用HTTP Basic Auth，可使用 http://user:password@host/ 语法:

proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}

## 合规性

Requests符合全部相关的规范和RFC，这样不会为用户形成没必要要的困难。但这种对规范的考虑致使一些行为对于不熟悉相关规范的人来讲看似有点奇怪。

编码方式

当你收到一个响应时，Requests会猜想响应的编码方式，用于在你调用 Response.text 方法时对响应进行解码。Requests首先在HTTP头部检测是否存在指定的编码方式，若是不存在，则会使用 charade 来尝试猜想编码方式。

只有当HTTP头部不存在明确指定的字符集，而且 Content-Type 头部字段包含 text 值之时， Requests才不去猜想编码方式。

在这种状况下， RFC 2616 指定默认字符集必须是 ISO-8859-1 。Requests听从这一规范。若是你须要一种不一样的编码方式，你能够手动设置 Response.encoding 属性，或使用原始的 Response.content 。(可结合上一篇安装使用快速上手中的 响应内容 学习)

## HTTP请求类型(附加例子)

Requests提供了几乎全部HTTP请求类型的功能：GET，OPTIONS， HEAD，POST，PUT，PATCH和DELETE。如下内容为使用Requests中的这些请求类型以及Github API提供了详细示例。

我将从最常使用的请求类型GET开始。HTTP GET是一个幂等的方法，从给定的URL返回一个资源。于是，当你试图从一个web位置获取数据之时，你应该使用这个请求类型。一个使用示例是尝试从Github上获取关于一个特定commit的信息。假设咱们想获取Requests的commit a050faf 的信息。咱们能够这样去作:

>>> import requests
>>> r = requests.get('https://api.github.com/repos/kennethreitz/requests/git/commits/a050faf084662f3a352dd1a941f2c7c9f886d4ad')

咱们应该确认Github是否正确响应。若是正确响应，咱们想弄清响应内容是什么类型的。像这样去作:

>>> if (r.status_code == requests.codes.ok):
...     print r.headers['content-type']
...
application/json; charset=utf-8

可见，GitHub返回了JSON数据，很是好，这样就可使用 r.json 方法把这个返回的数据解析成Python对象。

>>> commit_data = r.json()
>>> print commit_data.keys()
[u'committer', u'author', u'url', u'tree', u'sha', u'parents', u'message']
>>> print commit_data[u'committer']
{u'date': u'2012-05-10T11:10:50-07:00', u'email': u'me@kennethreitz.com', u'name': u'Kenneth Reitz'}
>>> print commit_data[u'message']
makin' history

到目前为止，一切都很是简单。嗯，咱们来研究一下GitHub的API。咱们能够去看看文档，但若是使用Requests来研究也许会更有意思一点。咱们能够借助Requests的OPTIONS请求类型来看看咱们刚使用过的url 支持哪些HTTP方法。

>>> verbs = requests.options(r.url)
>>> verbs.status_code
500

额，这是怎么回事？毫无帮助嘛！原来GitHub，与许多API提供方同样，实际上并未实现OPTIONS方法。这是一个恼人的疏忽，但不要紧，那咱们可使用枯燥的文档。然而，若是GitHub正确实现了OPTIONS，那么服务器应该在响应头中返回容许用户使用的HTTP方法，例如：

>>> verbs = requests.options('http://a-good-website.com/api/cats')
>>> print verbs.headers['allow']
GET,HEAD,POST,OPTIONS

转而去查看文档，咱们看到对于提交信息，另外一个容许的方法是POST，它会建立一个新的提交。因为咱们正在使用Requests代码库，咱们应尽量避免对它发送笨拙的POST。做为替代，咱们来玩玩GitHub的Issue特性。

本篇文档是回应Issue #482而添加的。鉴于该问题已经存在，咱们就以它为例。先获取它。

>>> r = requests.get('https://api.github.com/repos/kennethreitz/requests/issues/482')
>>> r.status_code
200
>>> issue = json.loads(r.text)
>>> print issue[u'title']
Feature any http verb in docs
>>> print issue[u'comments']
3

Cool，有3个评论。咱们来看一下最后一个评论。

>>> r = requests.get(r.url + u'/comments')
>>> r.status_code
200
>>> comments = r.json()
>>> print comments[0].keys()
[u'body', u'url', u'created_at', u'updated_at', u'user', u'id']
>>> print comments[2][u'body']
Probably in the "advanced" section

嗯，那看起来彷佛是个愚蠢之处。咱们发表个评论来告诉这个评论者他本身的愚蠢。那么，这个评论者是谁呢？

>>> print comments[2][u'user'][u'login']
kennethreitz

好，咱们来告诉这个叫肯尼思的家伙，这个例子应该放在快速上手指南中。根据GitHub API文档，其方法是POST到该话题。咱们来试试看。

>>> body = json.dumps({u"body": u"Sounds great! I'll get right on it!"})
>>> url = u"https://api.github.com/repos/kennethreitz/requests/issues/482/comments"
>>> r = requests.post(url=url, data=body)
>>> r.status_code
404

额，这有点古怪哈。可能咱们须要验证身份。那就有点纠结了，对吧？不对。Requests简化了多种身份验证形式的使用，包括很是常见的Basic Auth。

>>> from requests.auth import HTTPBasicAuth
>>> auth = HTTPBasicAuth('fake@example.com', 'not_a_real_password')
>>> r = requests.post(url=url, data=body, auth=auth)
>>> r.status_code
201
>>> content = r.json()
>>> print(content[u'body'])
Sounds great! I'll get right on it.

精彩！噢，不！我本来是想说等我一会，由于我得去喂一下个人猫。若是我可以编辑这条评论那就行了！幸运的是，GitHub容许咱们使用另外一个HTTP动词，PATCH，来编辑评论。咱们来试试。

>>> print(content[u"id"])
5804413
>>> body = json.dumps({u"body": u"Sounds great! I'll get right on it once I feed my cat."})
>>> url = u"https://api.github.com/repos/kennethreitz/requests/issues/comments/5804413"
>>> r = requests.patch(url=url, data=body, auth=auth)
>>> r.status_code
200

很是好。如今，咱们来折磨一下这个叫肯尼思的家伙，我决定要让他急得团团转，也不告诉他是我在捣蛋。这意味着我想删除这条评论。GitHub容许咱们使用彻底名副其实的DELETE方法来删除评论。咱们来清除该评论。

>>> r = requests.delete(url=url, auth=auth)
>>> r.status_code
204
>>> r.headers['status']
'204 No Content'

很好。不见了。最后一件我想知道的事情是我已经使用了多少限额（ratelimit）。查查看，GitHub在响应头部发送这个信息，所以没必要下载整个网页，我将使用一个HEAD请求来获取响应头。

>>> r = requests.head(url=url, auth=auth)
>>> print r.headers
...
'x-ratelimit-remaining': '4995'
'x-ratelimit-limit': '5000'
...

很好。是时候写个Python程序以各类刺激的方式滥用GitHub的API，还可使用4995次呢。

## 响应头连接字段

许多HTTP API都有响应头连接字段的特性，它们使得API可以更好地自我描述和自我显露。

GitHub在API中为分页使用这些特性，例如:

>>> url = 'https://api.github.com/users/kennethreitz/repos?page=1&per_page=10'
>>> r = requests.head(url=url)
>>> r.headers['link']
'<https://api.github.com/users/kennethreitz/repos?page=2&per_page=10>; rel="next", <https://api.github.com/users/kennethreitz/repos?page=6&per_page=10>; rel="last"'

Requests会自动解析这些响应头连接字段，并使得它们很是易于使用:

>>> r.links["next"]
{'url': 'https://api.github.com/users/kennethreitz/repos?page=2&per_page=10', 'rel': 'next'}

>>> r.links["last"]
{'url': 'https://api.github.com/users/kennethreitz/repos?page=7&per_page=10', 'rel': 'last'}

## Transport Adapters

As of v1.0.0, Requests has moved to a modular internal design. Part of the reason this was done was to implement Transport Adapters, originally described here. Transport Adapters provide a mechanism to define interaction methods for an HTTP service. In particular, they allow you to apply per-service configuration.

Requests ships with a single Transport Adapter, the HTTPAdapter. This adapter provides the default Requests interaction with HTTP and HTTPS using the powerful urllib3 library. Whenever a Requests Session is initialized, one of these is attached to the Session object for HTTP, and one for HTTPS.

Requests enables users to create and use their own Transport Adapters that provide specific functionality. Once created, a Transport Adapter can be mounted to a Session object, along with an indication of which web services it should apply to.

>>> s = requests.Session()
>>> s.mount('http://www.github.com', MyAdapter())

The mount call registers a specific instance of a Transport Adapter to a prefix. Once mounted, any HTTP request made using that session whose URL starts with the given prefix will use the given Transport Adapter.

Many of the details of implementing a Transport Adapter are beyond the scope of this documentation, but take a look at the next example for a simple SSL use- case. For more than that, you might look at subclassing requests.adapters.BaseAdapter.

Example: Specific SSL Version

The Requests team has made a specific choice to use whatever SSL version is default in the underlying library (urllib3). Normally this is fine, but from time to time, you might find yourself needing to connect to a service-endpoint that uses a version that isn’t compatible with the default.

You can use Transport Adapters for this by taking most of the existing implementation of HTTPAdapter, and adding a parameter ssl_version that gets passed-through to urllib3. We’ll make a TA that instructs the library to use SSLv3:

import ssl

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.poolmanager import PoolManager


class Ssl3HttpAdapter(HTTPAdapter):
    """"Transport adapter" that allows us to use SSLv3."""

    def init_poolmanager(self, connections, maxsize, block=False):
        self.poolmanager = PoolManager(num_pools=connections,
                                       maxsize=maxsize,
                                       block=block,
                                       ssl_version=ssl.PROTOCOL_SSLv3)

## Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see 流式请求) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python’s asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

## Timeouts

Most requests to external servers should have a timeout attached, in case the server is not responding in a timely manner. Without a timeout, your code may hang for minutes or more.

The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine (corresponding to the connect()) call on the socket. It’s a good practice to set connect timeouts to slightly larger than a multiple of 3, which is the default TCP packet retransmission window.

Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it’s the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).

If you specify a single value for the timeout, like this:

r = requests.get('https://github.com', timeout=5)

The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:

r = requests.get('https://github.com', timeout=(3.05, 27))

If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.

r = requests.get('https://github.com', timeout=None)

## CA Certificates

By default Requests bundles a set of root CAs that it trusts, sourced from the Mozilla trust store. However, these are only updated once for each Requests version. This means that if you pin a Requests version your certificates can become extremely out of date.

From Requests version 2.4.0 onwards, Requests will attempt to use certificates from certifi if it is present on the system. This allows for users to update their trusted certificates without having to change the code that runs on their system.

For the sake of security we recommend upgrading certifi frequently!

说明：前面有些官方文档没翻译到的，我本身翻译了，后一部分，时间太晚了，是在没精力了，之后有时间再翻译，可能我翻译的有些语句不通顺，可是仍是能大概表达出意思的，若是你对比了官方文档，以为你能够翻译得更好，能够私信或留言我哦

想喷个人人也省省吧，的确，这篇文章和以前的一篇Requests安装使用都是我从官网移植过来的，可是我花时间翻译了一部分，排版也废了番功夫，使用MarkDown写成，须要源md文档也能够找我索要，本文随意传播

我是Akkuman，同道人能够和我一块儿交流哦，私信或留言都可,个人博客hacktech.cn | akkuman.cnblogs.com