爬虫之重要的requests模块

时间 2019-11-17

原文原文链接

一 . requests模块

什么是requests模块
- requests模块是python中原生的基于网络请求的模块，其主要做用是用来模拟浏览器发起请求。功能强大，用法简洁高效。在爬虫领域中占据着半壁江山的地位。
为何要使用requests模块
- 由于在使用urllib模块的时候，会有诸多不便之处，总结以下：
  - 手动处理url编码
  - 手动处理post请求参数
  - 处理cookie和代理操做繁琐
  - ......
- 使用requests模块：
  - 自动处理url编码
  - 自动处理post请求参数
  - 简化cookie和代理操做
  - ......
如何使用requests模块
- 安装：
  - pip install requests
- 使用流程
  - 指定url
  - 基于requests模块发起请求
  - 获取响应对象中的数据值
  - 持久化存储

二 . 案例详情

　　1. 案例一 : 爬取搜狗指定词条搜索后的页面数据html

　　　　基于requests模块的get请求python

import requests
import os
#指定搜索关键字
word = input('enter a word you want to search:')
#自定义请求头信息
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    }
#指定url
url = 'https://www.sogou.com/web'
#封装get请求参数
prams = {
    'query':word,
    'ie':'utf-8'
}
#发起请求
response = requests.get(url=url,params=param)

#获取响应数据
page_text = response.text

with open('./sougou.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

　　请求载体身份标识的假装：web

- User-Agent：请求载体身份标识，经过浏览器发起的请求，请求载体为浏览器，则该请求的User-Agent为浏览器的身份标识，使用爬虫程序发起的请求，则该请求的载体为爬虫程序，则该请求的User-Agent为爬虫程序的身份标识。能够经过判断该值来获知该请求的载体到底是基于哪款浏览器仍是基于爬虫程序。ajax
- 反爬机制：某些门户网站会对访问该网站的请求中的User-Agent进行捕获和判断，若是该请求的UA为爬虫程序，则拒绝向该请求提供数据。json
- 反反爬策略：将爬虫程序的UA假装成某一款浏览器的身份标识。浏览器

　　2 . 案例二 : 登陆豆瓣电影，爬取登陆成功后的页面数据服务器

　　　　基于requests模块的post请求　cookie

import requests
import os
url = 'https://accounts.douban.com/login'
#封装请求参数
data = {
    "source": "movie",
    "redir": "https://movie.douban.com/",
    "form_email": "15027900535",
    "form_password": "bobo@15027900535",
    "login": "登陆",
}
#自定义请求头信息
headers={
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    }
response = requests.post(url=url,data=data)
page_text = response.text

with open('./douban111.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

　　3 . 案例三 : 爬取豆瓣电影分类排行榜中的电影详情数据网络

　　　　基于requests模块的ajax的get请求session

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
import urllib.request
if __name__ == "__main__":

    #指定ajax-get请求的url（经过抓包进行获取）
    url = 'https://movie.douban.com/j/chart/top_list?'

    #定制请求头信息，相关的头信息必须封装在字典结构中
    headers = {
        #定制请求头中的User-Agent参数，固然也能够定制请求头中其余的参数
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
    }

    #定制get请求携带的参数(从抓包工具中获取)
    param = {
        'type':'5',
        'interval_id':'100:90',
        'action':'',
        'start':'0',
        'limit':'20'
    }
    #发起get请求，获取响应对象
    response = requests.get(url=url,headers=headers,params=param)

    #获取响应内容：响应内容为json串
    print(response.text)

　　4 . 案例四 : 爬取肯德基餐厅查询中指定地点的餐厅数据

　　　　基于requests模块的ajax的post请求

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
import urllib.request
if __name__ == "__main__":

    #指定ajax-post请求的url（经过抓包进行获取）
    url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

    #定制请求头信息，相关的头信息必须封装在字典结构中
    headers = {
        #定制请求头中的User-Agent参数，固然也能够定制请求头中其余的参数
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
    }

    #定制post请求携带的参数(从抓包工具中获取)
    data = {
        'cname':'',
        'pid':'',
        'keyword':'北京',
        'pageIndex': '1',
        'pageSize': '10'
    }
    #发起post请求，获取响应对象
    response = requests.get(url=url,headers=headers,data=data)

    #获取响应内容：响应内容为json串
    print(response.text)

　　5 . 案例五 : 爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据

　　　　综合

import requests
from fake_useragent import UserAgent

ua = UserAgent(use_cache_server=False,verify_ssl=False).random
headers = {
    'User-Agent':ua
}

url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList'
pageNum = 3
for page in range(3,5):
    data = {
        'on': 'true',
        'page': str(page),
        'pageSize': '15',
        'productName':'',
        'conditionType': '1',
        'applyname':'',
        'applysn':''
    }
    json_text = requests.post(url=url,data=data,headers=headers).json()
    all_id_list = []
    for dict in json_text['list']:
        id = dict['ID']#用于二级页面数据获取
        #下列详情信息能够在二级页面中获取
        # name = dict['EPS_NAME']
        # product = dict['PRODUCT_SN']
        # man_name = dict['QF_MANAGER_NAME']
        # d1 = dict['XC_DATE']
        # d2 = dict['XK_DATE']
        all_id_list.append(id)
    #该url是一个ajax的post请求
    post_url = 'http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
    for id in  all_id_list:
        post_data = {
            'id':id
        }
        response = requests.post(url=post_url,data=post_data,headers=headers)
        #该请求响应回来的数据有两个，一个是基于text，一个是基于json的，因此能够根据content-type,来获取指定的响应数据
        if response.headers['Content-Type'] == 'application/json;charset=UTF-8':
            #print(response.json())
            #进行json解析
            json_text = response.json()
            print(json_text['businessPerson'])

三 . 基于requests模块的cookie操做

　　有些时候，咱们在使用爬虫程序去爬取一些用户相关信息的数据（爬取张三“人人网”我的主页数据）时，若是使用以前requests模块常规操做时，每每达不到咱们想要的目的，例如：

import requests
if __name__ == "__main__":

    #张三人人网我的信息页面的url
    url = 'http://www.renren.com/289676607/profile'

   #假装UA
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    }
    #发送请求，获取响应对象
    response = requests.get(url=url,headers=headers)
    #将响应内容写入文件
    with open('./renren.html','w',encoding='utf-8') as fp:
        fp.write(response.text)

View Code

　　结果发现，写入到文件中的数据，不是张三我的页面的数据，而是人人网登录的首页面，why？首先咱们来回顾下cookie的相关概念及做用：

　　　　- cookie概念：当用户经过浏览器首次访问一个域名时，访问的web服务器会给客户端发送数据，以保持web服务器与客户端之间的状态保持，这些数据就是cookie。

　　　　- cookie做用：咱们在浏览器中，常常涉及到数据的交换，好比你登陆邮箱，登陆一个页面。咱们常常会在此时设置30天内记住我，或者自动登陆选项。那么它们是怎么记录信息的呢，答案就是今天的主角cookie了，Cookie是由HTTP服务器设置的，保存在浏览器中，但HTTP协议是一种无状态协议，在数据交换完毕后，服务器端和客户端的连接就会关闭，每次交换数据都须要创建新的连接。就像咱们去超市买东西，没有积分卡的状况下，咱们买完东西以后，超市没有咱们的任何消费信息，但咱们办了积分卡以后，超市就有了咱们的消费信息。cookie就像是积分卡，能够保存积分，商品就是咱们的信息，超市的系统就像服务器后台，http协议就是交易的过程。

- 通过cookie的相关介绍，其实你已经知道了为何上述案例中爬取到的不是张三我的信息页，而是登陆页面。那应该如何抓取到张三的我的信息页呢？

　　思路：

　　　　1.咱们须要使用爬虫程序对人人网的登陆时的请求进行一次抓取，获取请求中的cookie数据

　　　　2.在使用我的信息页的url进行请求时，该请求须要携带 1 中的cookie，只有携带了cookie后，服务器才可识别此次请求的用户信息，方可响应回指定的用户信息页数据

import requests

#登陆请求的url（经过抓包工具获取）
post_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=201873958471'
    #建立一个session对象，该对象会自动将请求中的cookie进行存储和携带
session = requests.session()
   #假装UA
headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    }
formdata = {
        'email': '17701256561',
        'icode': '',
        'origURL': 'http://www.renren.com/home',
        'domain': 'renren.com',
        'key_id': '1',
        'captcha_type': 'web_login',
        'password': '7b456e6c3eb6615b2e122a2942ef3845da1f91e3de075179079a3b84952508e4',
        'rkey': '44fd96c219c593f3c9612360c80310a3',
        'f': 'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Dm7m_NSUp5Ri_ZrK5eNIpn_dMs48UAcvT-N_kmysWgYW%26wd%3D%26eqid%3Dba95daf5000065ce000000035b120219',
    }
    #使用session发送请求，目的是为了将session保存该次请求中的cookie
session.post(url=post_url,data=formdata,headers=headers)

get_url = 'http://www.renren.com/960481378/profile'
    #再次使用session进行请求的发送，该次请求中已经携带了cookie
response = session.get(url=get_url,headers=headers)
    #设置响应内容的编码格式
response.encoding = 'utf-8'
    #将响应内容写入文件
with open('./renren.html','w') as fp:
    fp.write(response.text)