requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库,requests会比urllib更加方便,能够节约咱们大量的工做。html
requests是python实现的最简单易用的HTTP库,建议爬虫使用requests库。默认安装好python以后,是没有安装requests模块的,须要单独经过pip安装python
import requests requests.get("http://httpbin.org/get") requests.post("http://httpbin.org/post") requests.put("http://httpbin.org/put") requests.delete("http://httpbin.org/delete") requests.head("http://httpbin.org/get") requests.options("http://httpbin.org/get")
1 基本请求json
res = requests.get('https://www.jd.com/') with open("jd.html", 'wb') as f: f.write(res.content)
2 含参数请求服务器
params参数指url '?'后的键值对cookie
res = requests.get('https://list.tmall.com/search_product.html') res = requests.get('https://list.tmall.com/search_product.htm', params={"q": "手机"}) with open('tao_bao.html', 'wb') as f: f.write(res.content)
3 含请求头请求session
res = requests.get("https://dig.chouti.com/", headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' }) with open('chouti.html', 'wb') as f: f.write(res.content)
4 含cookies请求app
import uuid res = requests.get("http://httpbin.org/cookies", cookies={'sbid': str(uuid.uuid4()), 'a': '1'}) print(res.text)
5 session对象post
session = requests.session() session.post('/login/') session.get('/index/')
requests.post()用法与requests.get()彻底一致,特殊的是requests.post()多了一个data参数网站
1 data参数ui
用于存放请求体数据。content-type默认为application/x-www-form-urlencoed,此时请求体数据放于字典'form'键中
res = requests.post('http://httpbin.org/post', params={'a': '10'}, data={'name': 'ethan'}) print(res.text)
2 发送json数据
此时请求体数据放于字典'data'键中
res1 = requests.post('http://httpbin.org/post', json={'name': 'ethan'}) # 没有指定请求头,默认请求头Content-Type: application/x-www-form-urlencoded print(res1.json()) res2 = requests.post('http://httpbin.org/post', json={'age': '24'}) # 默认请求头Content-Type:application/json print(res2.json())
一些网站会有相应的反爬虫措施,例如不少网站会检测某一段时间某个IP的访问次数,若是访问频率太快以致于看起来不像正常访客,它可能就会禁止这个IP的访问。因此咱们须要设置一些代理服务器,每隔一段时间换一个代理,就算IP被禁止,依然能够换个IP继续爬取。
res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json() print(res)
1 常见属性
import requests respone=requests.get('https://sh.lianjia.com/ershoufang/') # respone属性 print(respone.text) print(respone.content) print(respone.status_code) print(respone.headers) print(respone.cookies) print(respone.cookies.get_dict()) print(respone.cookies.items()) print(respone.url) print(respone.history) print(respone.encoding)
2 编码问题
requests默认编码为ISO-8859-1
res = requests.get('https://www.autohome.com.cn/beijing/') # 该网站页面内容为gb2312编码的,若是不设置成gbk则中文乱码 # 爬取方式一 with open('autohome.html', 'wb') as f: f.write(res.content)
# 爬取方式二 res.encoding = 'gbk' with open("autohome.html", 'w') as f: f.write(res.text)
3 下载二进制文件(图片,视频,音频)
res = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1551350578249&di=23ff7cbf4b8b47fe212e67ba3aab3267&imgtype=0&src=http%3A%2F%2Fimg.hx2cars.com%2Fupload%2Fnewimg2%2FM03%2FA9%2F03%2FClo8xFklT1GAU059AAR1t2rZPz4517_small_800_600.jpg') with open('c180.jpg', 'wb') as f: # f.write(res.content) for line in res.iter_content(): # 按行写入 f.write(line)
4 解析json数据
res = requests.get('http://httpbin.org/get') print(res.text) print(type(res.text)) # str print(res.json()) print(type(res.json())) # dict
5 Redirection and history
默认状况下,除了requests.head,requests会自动处理全部重定向。能够使用响应对象的history方法来追踪重定向。
response.history是一个response对象的列表,为了完成请求而建立了这些对象。这个对象列表按照从最老到最近的请求进行排序
res = requests.get('http://www.jd.com/') print(res.history) # [<Response [302]>] print(res.text) print(res.status_code) # 200
可经过allow_redirects参数禁用重定向处理:
res = requests.get('http://www.jd.com/', allow_redirects=False) print(res.history) # [] print(res.status_code) # 302