爬虫——Requests模块

时间 2019-12-05

标签爬虫 requests 模块栏目网络爬虫繁體版

原文原文链接

1. 入门

1.1 为何用 requests，不用 urllib

requests的底层实现就是urllib
requests在python2 和python3中通用，方法彻底同样
requests 简单易用
requests 能帮助咱们解压网页内容

1.2 requests 做用

发送网络请求，返回相应数据
中文文档API

1.3 示例

import requests

response = requests.get('http://www.baidu.com')
# 在 Python3中 decode不带参数，默认为 utf-8 解码
print(response.content.decode())
# 根据HTTP 头部对响应的编码做出有根据的推测，推测的文本编码， 此处输出乱码
print(response.text)
# 输出当前推测的文本编码为  ISO-8859-1 
print(response.encoding)
# 修改成utf-8
response.encoding = 'utf-8'
# 此处输出正常， 与 response.content.decode() 输出一致
print(response.text)
复制代码

response.text 与 response.content 的区别html
- response.text
  类型：str
  解码类型：根据HTTP 头部对响应的编码做出有根据的推测，推测的文本编码
  如何修改编码方式：response.encoding = 'utf-8'python
- response.content
  类型：bytes
  解码类型：没有指定
  如何修改编码方式：response.content.deocde(“utf8”)web

1.4 requests 保存图片

import requests

response = requests.get('https://www.baidu.com/img/bd_logo1.png?where=super')

with open('img.png', 'wb') as f:
    f.write(response.content)
复制代码

1.5 response 的经常使用方法

response.textchrome
respones.content浏览器
response.status_code安全

状态码服务器
response.request.urlcookie

请求的URL地址网络

response.request.headerssession

请求头

{
  'User-Agent': 'python-requests/2.19.1',
  'Accept-Encoding': 'gzip, deflate',
  'Accept': '*/*',
  'Connection': 'keep-alive'
}
复制代码

response.headers

响应头

{
  'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform',
  'Connection': 'Keep-Alive',
  'Content-Encoding': 'gzip',
  'Content-Type': 'text/html',
  'Date': 'Sun, 13 Jan 2019 02:15:14 GMT',
  'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT',
  'Pragma': 'no-cache',
  'Server': 'bfe/1.0.8.18',
  'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/',
  'Transfer-Encoding': 'chunked'
}
复制代码

1.6 发送带 header 和参数的请求

header

带header是为了模拟浏览器，欺骗服务器，获取和浏览器一致的内容

header的形式：字典

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
}
复制代码

用法： requests.get(url,headers=headers)

参数
- 例： www.baidu.com/s?wd=python
- 参数的形式：字典
```
params = {
    "wd": "python"
}
复制代码
```
- 用法：requests.get(url,params=kw)

代码示例：

import requests

url = 'http://www.baidu.com/s?'

headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
}

params = {
  "wd": "python"
}

response = requests.get(url, headers=headers, params=params)

print(response.request.url)
print(response.status_code)

#格式化形式
url_param = 'http://www.baidu.com/s?wd={}'.format('python')
response1 = requests.get(url_param, headers=headers)
print(response.request.url)
print(response.status_code)   
复制代码

1.7 贴吧爬虫

import requests


class TiebaSpider(object):
    def __init__(self, tieba_name):
        self.name = tieba_name
        self.url_tmp = 'https://tieba.baidu.com/f?kw=' + tieba_name + '&ie=utf-8&pn={}'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0"
        }

    def get_url_list(self):
        # url_list = []
        # for i in range(1000):
        #     url_list.append(self.url_temp.format(i*50))
        # return url_list
        '''
        [i * 2 for i in range(3)]
        [0, 2, 4]
        '''
        return [self.url_tmp.format(i * 50) for i in range(1000)]

    def parse_url(self, url):
        print(url)
        response = requests.get(url, headers=self.headers)
        return response.content.decode()

    def save_html(self, html, page_num):
        file_path = '{}-第{}页.html'.format(self.name, page_num)
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(html)

    def run(self):
        # 1. 构造 url_list
        url_list =  self.get_url_list()
        # 2. 遍历请求
        for url in url_list:
            html = self.parse_url(url)
            # 3. 保存
            page_num = url_list.index(url) + 1
            self.save_html(html, page_num)


if __name__ == '__main__':
    # tieba_spider = TiebaSpider('李毅')
    tieba_spider = TiebaSpider('lol')
    tieba_spider.run()
复制代码

2. 进阶

2.1 发送 POST 请求

须要用到 POST 的状况：

登陆注册， POST 比 GET 更安全
须要传输大文本内容的时候，POST 请求对数据长度没有要求

用法：

response = requests.post("http://www.baidu.com/", data = data,headers=headers)
复制代码

注意和 GET 的区别， GET 中为 params=data, data 为字典形式

示例：
百度翻译API

2.2 使用代理

缘由：

让服务器觉得不是一个客户端在请求
防止咱们的真实地址被泄露

代理工做流程：

正向代理与反向代理：

通常状况下，不知道最终服务器的地址为反向代理，知道最终服务器的为正向代理。

用法：

requests.get("http://www.baidu.com", proxies = proxies)
复制代码

proxies 为字典形式

proxies = { 
    "http": "http://12.34.56.79:9527", 
    "https": "https://12.34.56.79:9527", 
}
复制代码

私密代理

若是代理须要使用HTTP Basic Auth，可使用下面这种格式：

proxies = { 
  "http": "http://user:password@10.1.10.1:1234"
}
复制代码

使用代理 ip：

准备一堆的ip地址，组成ip池，随机选择一个ip来时用
如何随机选择代理ip，让使用次数较少的ip地址有更大的可能性被用到
- {"ip":ip,"times":0}
- [{},{},{},{},{}],对这个ip的列表进行排序，按照使用次数进行排序
- 选择使用次数较少的10个ip，从中随机选择一个
检查ip的可用性
- 可使用requests添加超时参数，判断ip地址的质量
- 在线代理ip质量检测的网站

示例：

import requests
proxies = {
    "http": "http://119.101.113.180:9999"
}
response = requests.get("http://www.baidu.com", proxies=proxies)
print(response.status_code)
复制代码

2.3 Cookie 和 Session

cookie数据存放在客户的浏览器上，session数据放在服务器上。
cookie不是很安全，别人能够分析存放在本地的cookie并进行cookie欺骗。
session会在必定时间内保存在服务器上。当访问增多，会比较占用你服务器的性能。
单个cookie保存的数据不能超过4K，不少浏览器都限制一个站点最多保存20个cookie。

获取登陆后的页面的三种方式

实例化session，使用session发送post请求，在使用他获取登录后的页面
- 实例化session
- 先使用session发送请求，登陆对网站，把cookie保存在session中
- 再使用session请求登录以后才能访问的网站，session可以自动的携带登陆成功时保存在其中的cookie，进行请求
headers中添加cookie键，值为cookie字符串
在请求方法中添加cookies参数，接收字典形式的cookie。字典形式的cookie中的键是cookie的name对应的值，值是cookie的value对应的值
- 携带一堆cookie进行请求，把cookie组成cookie池

session 代码示例：

import requests

session = requests.session()
url = 'http://www.renren.com/PLogin.do'
data = {
    "email": "****",
    "password": "*****"
}
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
session.post(url, headers=headers, data=data)
response = session.get('http://www.renren.com/969487434/profile', headers=headers)
html = response.content.decode()

with open('renren1.html', 'w', encoding='utf-8') as f:
    f.write(html)
复制代码

headers 添加 Cookie 示例：

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Cookie": "anonymid=jr2yoyv4-l71jfd; depovince=SD; _r01_=1; JSESSIONID=abc5vZNDY5GXOfh79uKHw; ick_login=8e8d2154-31f7-47d6-afea-cd7da7f60cd7; ick=47b5a827-ecaf-4b44-ab57-c433e8f73b67; first_login_flag=1; ln_uact=13654252805; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=cf5d55ad-7eb1-4b50-848a-25d4c8081154%7C07a531353a345fda40d3ab252602e2f6%7C1547871575690%7C1%7C1547871574048; wp_fold=0; jebecookies=9724d2c7-5e9c-4be9-92d0-bf9b6dffd455|||||; _de=D0539E08F82219B3A527C713E360D2ED; p=7f8736045559e52d93420c14f063d70e4; t=522278d3c40436e9d5e7b3dc2650e55a4; societyguester=522278d3c40436e9d5e7b3dc2650e55a4; id=969487434; ver=7.0; xnsid=9ba2d506; loginfrom=null"
}
response = requests.get('http://www.renren.com/969487434/profile', headers=headers)
html = response.content.decode()

with open('renren2.html', 'w', encoding='utf-8') as f:
    f.write(html)
复制代码

在请求方法中添加cookies参数：

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
Cookie = 'anonymid=jr2yoyv4-l71jfd; depovince=SD; _r01_=1; JSESSIONID=abc5vZNDY5GXOfh79uKHw; ick_login=8e8d2154-31f7-47d6-afea-cd7da7f60cd7; ick=47b5a827-ecaf-4b44-ab57-c433e8f73b67; first_login_flag=1; ln_uact=13654252805; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=cf5d55ad-7eb1-4b50-848a-25d4c8081154%7C07a531353a345fda40d3ab252602e2f6%7C1547871575690%7C1%7C1547871574048; wp_fold=0; jebecookies=9724d2c7-5e9c-4be9-92d0-bf9b6dffd455|||||; _de=D0539E08F82219B3A527C713E360D2ED; p=7f8736045559e52d93420c14f063d70e4; t=522278d3c40436e9d5e7b3dc2650e55a4; societyguester=522278d3c40436e9d5e7b3dc2650e55a4; id=969487434; ver=7.0; xnsid=9ba2d506; loginfrom=null'
# 字典推导式
cookies = {i.split("=")[0] : i.split("=")[1] for i in Cookie.split("; ")}
response = requests.get('http://www.renren.com/969487434/profile', headers=headers, cookies=cookies)
html = response.content.decode()

with open('renren3.html', 'w', encoding='utf-8') as f:
    f.write(html)
复制代码

2.4 寻找登陆 POST 的地址

查看 HTML 页面，在form表单中寻找action对应的url地址
- post的数据是input标签中name的值做为键，真正的用户名密码做为值的字典，post的url地址就是action对应的url地址
抓包，寻找登陆的url地址
- 勾选perserve log按钮，防止页面跳转找不到url
- 寻找post数据，肯定参数
  - 参数不会变，直接用，好比密码不是动态加密的时候
  - 参数会变
    - 参数在当前的响应中
    - 经过js生成

2.5 定位想要的js

选择会触发js时间的按钮，点击event listener，找到js的位置
经过chrome中的search all file来搜索url中关键字
添加断点的方式来查看js的操做，经过python来进行一样的操做

2.6 requests 小技巧

reqeusts.util.dict_from_cookiejar 把cookie对象转化为字典

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
    "Cookie": "anonymid=jr2yoyv4-l71jfd; depovince=SD; _r01_=1; JSESSIONID=abc5vZNDY5GXOfh79uKHw; ick_login=8e8d2154-31f7-47d6-afea-cd7da7f60cd7; ick=47b5a827-ecaf-4b44-ab57-c433e8f73b67; first_login_flag=1; ln_uact=13654252805; ln_hurl=http://head.xiaonei.com/photos/0/0/men_main.gif; jebe_key=cf5d55ad-7eb1-4b50-848a-25d4c8081154%7C07a531353a345fda40d3ab252602e2f6%7C1547871575690%7C1%7C1547871574048; wp_fold=0; jebecookies=9724d2c7-5e9c-4be9-92d0-bf9b6dffd455|||||; _de=D0539E08F82219B3A527C713E360D2ED; p=7f8736045559e52d93420c14f063d70e4; t=522278d3c40436e9d5e7b3dc2650e55a4; societyguester=522278d3c40436e9d5e7b3dc2650e55a4; id=969487434; ver=7.0; xnsid=9ba2d506; loginfrom=null"
}
response = requests.get('http://www.renren.com/969487434/profile', headers=headers)
print(response.cookies)
print(requests.utils.dict_from_cookiejar(response.cookies))
复制代码

请求 SSL 证书验证

response = requests.get("https://www.12306.cn/mormhweb/ ", verify=False)
复制代码

设置超时

response = requests.get("https://www.baidu.com ", timeout=2)
复制代码

配合状态码判断是否请求成功

assert response.status_code == 200
复制代码

示例代码（可作工具类）:

import requests
from retrying import retry

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}


@retry(stop_max_attempt_number=3)
def _request_rul(url, method, data):
    if method == 'POST':
        response = requests.post(url, headers=headers, data=data)
    else:
        response = requests.get(url, headers=headers, params=data, timeout=3)
    assert response.status_code
    return response.content.decode()


def request_url(url, method='GET', data=None):
    try:
        html =  _request_rul(url, method, data)
    except:
        html = None
    return html


if __name__ == '__main__':
    url = 'http://www.baidu.com'
    print(request_url(url))
复制代码

2.7 web客户端验证

若是是Web客户端验证，须要添加 auth = (帐户名, 密码)

import requests
auth=('test', '123456')
response = requests.get('http://192.168.199.107', auth = auth)
print (response.text)复制代码