爬虫基础之requests模块

时间 2019-12-01

标签爬虫基础 requests 模块栏目网络爬虫繁體版

原文原文链接

1. 爬虫简介

1.1 概述

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更常常的称为网页追逐者），是一种按照必定的规则，自动地抓取万维网信息的程序或者脚本。html

1.2 爬虫的价值

在互联网的世界里最有价值的即是数据, 谁掌握了某个行业的行业内的第一手数据, 谁就是该行业的主宰. 掌握了爬虫技能，你就成了全部互联网信息公司幕后的老板，python

换言之，它们都在免费为你提供有价值的数据。git

1.3 robots.txt协议

若是本身的门户网站中的指定页面中的数据不想让爬虫程序爬取到的话，那么则能够经过编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。robots协议的编写github

格式能够观察淘宝网的robots（访问www.taobao.com/robots.txt便可）。可是须要注意的是，该协议只是至关于口头的协议，并无使用相关技术进行强制管制，因此正则表达式

该协议是防君子不防小人。redis

1.4 爬虫的基本流程

发送请求: 经过相关模块或者库如浏览器通常向目标站点发送请求, 即一个request, 请求能够携带headers和参数等信息, 而后等待服务器响应
获取响应: 服务器正常响应, 会返回一个response, 即页面内容, 类型多是html, json或者二进制数据(音频视频图片等)
解析数据: 响应的字符串能够经过正则表达式或者BeautifulSoup, xpath等解析器提炼出咱们须要的数据
存储数据: 将解析出来的数据进行持久化保存, 能够存储到文件中, 也能够存储到redis, mondodb等数据库中

2 requests模块

Requests是用python语言基于urllib编写的, 采用的是Apache2 Licensed开源协议的HTTP库, Requests它会比urllib更加方便, 能够节约咱们大量的工做. 一句话,数据库

requests是python实现的最简单易用的HTTP库, 建议爬虫使用requests库. 默认安装好python以后, 是没有安装requests模块的, 须要单独经过pip安装.json

2.1 基本语法

requests模块支持的请求

import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

get请求

1. 基本请求浏览器

import requests
response=requests.get('https://www.jd.com/',)
 
with open("jd.html","wb") as f:
    f.write(response.content)

2. 含参数请求服务器

import requests
response=requests.get('https://s.taobao.com/search?q=手机')
response=requests.get('https://s.taobao.com/search',params={"q":"美女"}):
    f.write(res.content)

3. 含请求头请求

import requests
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
#反爬机制: UA反爬
res = requests.get("https://www.baidu.com/s",params={"wd":"刘亦菲"}, headers=headers)
with open("jay.html","wb") as f:
    f.write(res.content)

4. 含cookies请求

import uuid
import requests

url = 'http://httpbin.org/cookies'
cookies = dict(sbid=str(uuid.uuid4()))

res = requests.get(url, cookies=cookies)
print(res.text)

post请求

1 data参数

requests.post()用法与requests.get()彻底一致，特殊的是requests.post()多了一个data参数，用来存放请求体数据

import requests
res = requests.post("http://httpbin.org/post",params={"a":"10"}, data={"name":"Alex"} )
# 没有指定请求头 #默认的请求头: application/x-www-form-urlencoded
print(res.text)

2. 发送json数据

import requests
# 发送json数据
res = requests.post(url="http://httpbin.org/post", json={"age":"38"})
# # 默认的请求头: application/json
print(res.text)
# 同时有data和json参数时 data参数优先

session对象

import requests

# 1 建立session对象 使用方法和requests.get() .post 一致
session = requests.session()
res1 = session.get("https://github.com/login")
#后面的访问都会带着获取的session去访问
res2 = session.post("https://github.com/session")

response对象

1. 常见属性

import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')
# respone属性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)

2. 编码问题

import requests
res = requests.get("http://www.autohome.com/news")
# res.encoding = "gbk"
# with open("autohome.html","w") as f:
#     f.write(res.text)
#汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，若是不设置成gbk则中文乱码
# 或者也能够以"wb" 模式写入文件
with open("autohome.html","wb") as f:
    f.write(res.content)

3. 下载二进制文件(音频,视频,图片)

import requests
res = requests.get("https://video.pearvideo.com/mp4/adshort/20190222/cont-1520612-13609117_adpkg-ad_hd.mp4")
with open("lsp.mp4","wb") as f:
    # f.write(res.content)
    # 好比下载视频时,若是视频100G,用response.content而后一会儿写到文件中是不合理的
    for line in res.iter_content():
        f.write(line)

4. 解析json数据

import requests
import json
 
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text) #太麻烦
res2=response.json() #直接获取json数据
print(res1==res2

5. Redirection and History

默认状况下, 除了head, requests会自动处理全部重定向. 可使用响应对象的history方法来追踪重定向. Response.history是一个Response对象的列表,

为了完成请求而建立了这些对象. 这个对象列表按照从时间前后顺序进行排序.

import requests

res = requests.get("http://www.jd.com")
print(res.url)
# https://www.jd.com/
print(res.status_code)
# 200
print(res.history)
# [<Response [302]>]

另外, 能够经过allow_redirects 参数禁用重定向处理

import requests
res = requests.get("http://www.jd.com", allow_redirects=False)
print(res.status_code)
# 302
print(res.history)
# []

2.2 requests进阶用法

IP代理

一些网站会有相应的反爬虫措施，例如不少网站会检测某一段时间某个IP的访问次数，若是访问频率太快以致于看起来不像正常访客，它可能就会会禁止

这个IP的访问。因此咱们须要设置一些代理服务器，每隔一段时间换一个代理，就算IP被禁止，依然能够换个IP继续爬取。

res=requests.get('http://httpbin.org/ip', proxies={'http':'112.17.121.88:8060'}).json()
print(res)

免费代理

2.3 简单爬虫案例

爬取github的主页内容

import requests,re

# 1 请求获取token, 以便经过post请求校验
session = requests.session()
res = session.get("https://github.com/login")
token = re.findall('<input type="hidden" name="authenticity_token" value="(.*?)"',res.text)[0]
print(token)

# 2 构建post请求数据
data={
    "commit":"Sign in",
    "Sign in": "✓",
    "authenticity_token": token,
    "login": "aflychen",
    "password":"afly264028"
}
response = session.post("https://github.com/session",data=data)
with open("github.html","wb") as f:
    f.write(response.content)

简单的反爬策略

UA反爬: 在request请求头中有一个User-Agent的参数, 服务器经过判断请求是否携带此参数, 来判断访问端是人仍是机器.

token反爬: 登陆校验是判断请求的formdata中是否含有此数据来判断请求是否合法.