requests库使用

时间 2019-11-14

标签 requests 使用繁體版

原文原文链接

介绍：

发送HTTP请求的第三方库，比起以前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）
安装：pip3 install requests
学习requests前，能够先熟悉下HTTP协议
http://www.cnblogs.com/linhaifeng/p/6266327.html

GET请求：

import requests
from urllib import parse
param={'wd':'中国'}
# 对url进行传参
response = requests.get('http://www.baidu.com/s?', params=param)
print(response.url)
# url解码 ASCII --》utf8
print(parse.unquote(response.url))

>>输出
http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD
http://www.baidu.com/s?wd=中国

GET请求->headers

一般咱们在发送请求时都须要带上请求头，请求头是将自身假装成浏览器的关键html

# 添加headers(浏览器会识别请求头,不加可能会被拒绝访问,好比访问https://www.zhihu.com/explore)
import requests
response = requests.get('https://www.zhihu.com/explore')
print(response.status_code) # 返回500错误

# 本身定制headers
headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"
}
respone = requests.get('https://www.zhihu.com/explore',headers = headers)
print(respone.status_code) # 返回200

GET请求->cookies

headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
}
loginUrl = 'https://github.com/login'

# 获取cookies
cookies = response.cookies
print('cookies=>',cookies)

GET请求->代理

import requests
proxies={
    'http':'111.47.220.67:8080',
    'https':'111.47.220.67:8080',
}
response = requests.get('https://www.zhihu.com/explore',
                      proxies= proxies,headers = headers, verify=False)

print(response.status_code)

GET请求->超时设置

import requests

#timeout表明接收数据的超时时间
response = requests.get('https://www.baidu.com', timeout=1) 
print(response.status_code)

response

response属性

import requests
response = requests.get('http://www.jianshu.com')
# response属性
print(response.text)  # 文本数据str 通过转码的
print(response.content)  # 原始数据字节串bytes
print(response.status_code)  # 返回状态码 200
print(response.headers)
print(response.cookies)
print(response.cookies.get_dict())
print(response.cookies.items())
print(response.url)
print(response.history)
print(response.encoding)

编码问题

# 编码问题
import requests
response = requests.get('http://www.autohome.com/news')
print(response.headers['Content-Type'])  # 返回text/html
#  若是返回值不包括charset元素，默认返回编码为ISO-8859-1
print(response.encoding)  # 返回ISO-8859-1 按ISO-8859-1方式解码text
response.encoding = 'GBK'  # 汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，若是不设置成gbk则中文乱码
print(response.text)

response = requests.get('https://www.jianshu.com')
print(response.headers['Content-Type'])  # 返回text/html; charset=utf-8
#  返回值包括charset元素，返回编码为charset后的编码
print(response.encoding) # 返回utf-8 ,按utf-8方式解码text
print(response.text) # 简书返回的页面内容为utf-8编码的，在这里不用设置response.encoding = 'utf-8'

解析json

# 解析json
import requests
import json
response = requests.get('http://httpbin.org/get')
res1 = json.loads(response.text) # 以往获取方式太麻烦
res2 = response.json()  # 直接获取json数据
print(res2)
print(res1 == res2)  # True

获取二进制数据

import requests
response = requests.get('http://pic-bucket.nosdn.127.net/photo/0005/2018-02-26/DBIGGI954TM10005NOS.jpg')
with open('a.jpg', 'wb') as f:
    f.write(response.content)

# stream参数:一点一点的取,好比下载视频时,若是视频100G,用response.content而后一会儿写到文件中是不合理的
response = requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',
                      stream = True)

with open('b.mp4','wb') as f:
   # 获取二进制流(iter_content)
    for line in response.iter_content():
        f.write(line)

基于POST请求

一、介绍

#GET请求
HTTP默认的请求方法就是GET
     * 没有请求体
     * 数据必须在1K以内！
     * GET请求数据会暴露在浏览器的地址栏中

GET请求经常使用的操做：
       1. 在浏览器的地址栏中直接给出URL，那么就必定是GET请求
       2. 点击页面上的超连接也必定是GET请求
       3. 提交表单时，表单默认使用GET请求，但能够设置为POST


#POST请求
(1). 数据不会出如今地址栏中
(2). 数据的大小没有上限
(3). 有请求体
(4). 请求体中若是存在中文，会使用URL编码！


#！！！requests.post()用法与requests.get()彻底一致，特殊的是requests.post()有一个data参数，用来存放请求体数据

二、发送post请求，登陆github

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/2/27 20:42
# @Author  : hyang
# @Site    : 
# @File    : request_github.py
# @Software: PyCharm

import re
import requests
import http.cookiejar as cookielib
from requests.packages import urllib3

'''
一 目标站点分析
    浏览器输入https://github.com/login
    而后输入错误的帐号密码，经过Fiddle抓包
    发现登陆行为是post提交到：https://github.com/session
    并且请求头包含cookie
    并且请求体包含：
        commit:Sign in
        utf8:✓
        authenticity_token:lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r/m06NLyIbw7vCrpwrFAPzHMep3Tmf/TSJVoXWrvDZaVwxQ==
        login:908099665@qq.com
        password:123

二 流程分析
    先GET：https://github.com/login拿到初始cookie与authenticity_token
    返回POST：https://github.com/session， 带上初始cookie，带上请求体（authenticity_token，用户名，密码等）
    最后拿到登陆cookie

    ps：若是密码时密文形式，则能够先输错帐号，输对密码，而后到浏览器中拿到加密后的密码，github的密码是明文
'''
import ssl
# 解决某些环境下报<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
ssl._create_default_https_context = ssl._create_unverified_context
urllib3.disable_warnings() # 关闭警告

headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
}
loginUrl = 'https://github.com/login'
postUrl = 'https://github.com/session'

response = requests.get(loginUrl, headers=headers, verify=False)
# 获取authenticity_token
authenticity_token = re.findall(r'<input name="authenticity_token" type="hidden" value="(.*?)" />', response.text)
# 获取cookies
cookies = response.cookies
print('cookies=>',cookies)
print('authenticity_token=>',authenticity_token)

email='908099665@qq.com'
password='yanghaoXXXX'
post_data={
        "commit":"Sign in",
        "utf8":"✓",
        "authenticity_token":authenticity_token,
        "login":email,
         "password":password,
    }
response2 = requests.post(postUrl, data=post_data, headers=headers, verify=False, cookies=cookies)
print(response2.status_code)
print(response2.history)  # 跳转的历史状态码
print(response2.text)

分析抓包python

3. session的使用

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/2/26 23:31
# @Author  : hyang
# @Site    : 
# @File    : request-github.py
# @Software: PyCharm


import re
import requests
import urllib3
import http.cookiejar as cookielib

import ssl
# 解决某些环境下报<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
ssl._create_default_https_context = ssl._create_unverified_context
urllib3.disable_warnings() # 关闭警告

headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",
}

loginUrl = 'https://github.com/login'
postUrl = 'https://github.com/session'
profileUrl = 'https://github.com/settings/emails'
session = requests.session()  # 包括了cookies信息
# 生成 github_cookie文件
session.cookies = cookielib.LWPCookieJar(filename='github_cookie')

# 获取authenticity_token
def get_token():
        response = session.get(loginUrl, headers=headers, verify=False)
        html = response.text
        authenticity_token = re.findall(r'<input name="authenticity_token" type="hidden" value="(.*?)" />', html)
        print(authenticity_token)
        return authenticity_token

# 登录表单提交参数
def post_account(email, password):
    post_data = {
            'commit': 'Sign in',
            'utf8':'✓',
            'authenticity_token': get_token(),
            'login': email,
            'password': password
        }
    response = session.post(postUrl, data=post_data, headers=headers)
    print(response.status_code)
    # 保存cookies
    session.cookies.save()

def load_cookie():
        try:
           session.cookies.load(ignore_discard=True)
           print('cookie 获取成功')
        except:
            print('cookie 获取不成功')

# 判断是否登录成功
def isLogin():
        load_cookie()
        response = session.get(profileUrl, headers=headers)
        #print('908099665@qq.com' in response.text)
        return '908099665@qq.com' in response.text

if __name__ == "__main__":
    # 输入本身email帐号和密码
    post_account(email='908099665@qq.com', password='yanghaoXXXX')
    # 验证是否登录成功
    isLogin()

重定向

By default Requests will perform location redirection for all verbs except HEAD.

We can use the history property of the Response object to track redirection.

The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response.

For example, GitHub redirects all HTTP requests to HTTPS:

>>> r = requests.get('http://github.com')

>>> r.url
'https://github.com/'

>>> r.status_code

>>> r.history
[<Response [301]>]
If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter:

>>> r = requests.get('http://github.com', allow_redirects=False)

>>> r.status_code

>>> r.history
[]
If you're using HEAD, you can enable redirection as well:

>>> r = requests.head('http://github.com', allow_redirects=True)

>>> r.url
'https://github.com/'

>>> r.history
[<Response [301]>]

先看官网的解释

高级认证

#证书验证(大部分网站都是https)
import requests
respone=requests.get('https://www.12306.cn') #若是是ssl请求,首先检查证书是否合法,不合法则报错


#改进1:去掉报错,可是会报警告
import requests
respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200
print(respone.status_code)


#改进2:去掉报错,而且去掉警报信息
import requests
from requests.packages import urllib3
urllib3.disable_warnings() #关闭警告
respone=requests.get('https://www.12306.cn',verify=False)
print(respone.status_code)

#改进3:加上证书
#不少网站都是https,可是不用证书也能够访问,大多数状况都是能够携带也能够不携带证书
#知乎\百度等都是可带可不带
#有硬性要求的,则必须带，好比对于定向的用户,拿到证书后才有权限访问某个特定网站
import requests
respone=requests.get('https://www.12306.cn',
                     cert=('/path/server.crt',
                           '/path/key'))
print(respone.status_code)

文件上传

import requests
files = {'file':open('a.pptx','rb')}
respone = requests.post('http://httpbin.org/post',files=files)
print(respone.status_code)

异常处理

# 异常处理
import requests
from requests.exceptions import * # 能够查看requests.exceptions获取异常类型

try:
    r = requests.get('http://www.baiduxxx.com', timeout=1)
except ReadTimeout:
    print('ReadTimeout')
except ConnectionError: # 网络不通
    print('ConnectionError')
# except Timeout:
#     print('aaaaa')
except RequestException:
    print('Error')