爬虫简介与request模块

时间 2019-12-10

原文原文链接

1、爬虫简介

概述

近年来，随着网络应用的逐渐扩展和深刻，如何有效的获取网上数据成为了无数公司和我的的追求，在大数据时代，谁掌握了更多的数据，谁就能够得到更高的利益，而网络爬虫是其中最为经常使用的一种从网上爬取数据的手段。

网络爬虫，即web spider，是一个很形象的名字。若是把互联网比喻成一个蜘蛛网，那么spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是经过网页的连接地址来寻找网页的。从网站某一个页面开始，读取网页的内容，找到在网页中的其余连接地址，而后经过这些连接地址寻找下一个网页，这样一直循环下去，直到把这个网站全部的网页都抓取完为止。

爬虫的价值

互联网中最有价值的即是数据，好比天猫商城的商品信息，链家网的租房信息，雪球网的证券投资信息等，这些数据都表明了各个行业的真金白银，能够说，谁掌握了行业内的第一手数据，谁就成为了整个行业的主宰，若是把整个互联网的数据比喻为一座宝藏，那咱们的爬虫课程就是教你们如何高效地挖掘这些宝藏，掌握了爬虫技能，你就成为了全部互联网信息公司幕后地老板，换言之，它们都在免费为你提供有价值地数据

robots.txt协议

若是本身地门户网站中地指定页面中地数据不想让爬虫程序爬取到的话，那么能够经过编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。robots协议的编写格式能够观察淘宝网的robots（www.taobao.com/robots.txt)。可是须要注意的是，该协议只是至关于口头的协议，并无使用相关技术进行强制管制，因此该协议是防君子不防小人。可是咱们在学习爬虫阶段编写的爬虫程序能够先忽略robots协议

爬虫的基本流程

2、requests模块

requests是用python语言基于urllib编写的，采用的是Apache2 Licensed开源协议的HTTP库，resquests它会比urllib更加方便，能够节约咱们大量的工做。一句话，requsets是python实现最简易用的HTTP库，建议爬虫使用requests库。默认安装号python以后，是没有安装requests模块的，须要单独经过pip安装

2.1基本语法

requests模块支持的请求

import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

get请求

一、基本请求

import requests
response = requests.get('https://www.jd.com/',)
with open("jd.html","wb") as f:
f.write(response.content)

二、含参数请求

import requests
response=requests.get('https://s.taobao.com/search?q=手机')
response=requests.get('https://s.taobao.com/search',params={"q":"美女"})

三、含请求头请求

import requests
response = requests.get('http://dig.chouti.com/',
         headers={
                   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                         }
                     )

四、含cookies请求

import uuid
import requests
url = 'http://httpbin.org/cookies'
cookies = dict(sbid=str(uuid.uuid4()))
res = requests.get(url,cookies=cookies)
print(res.text)

post请求

1data参数

requests.post(）用法与requests.get()彻底一致，特殊的是requests.post()多了一个data参数，用来存放请求体数据

response = requests.post("http://httpbin.org/post",params={"a":"10"},data={"name":"yuan"})

2发送json数据

import requests
res1 = resquests.post(url='http://httpbin.org/post',data={'name':'yuan'})
#没有指定请求头，默认的请求头：application/x-www-form-urlencoed
print(res1.json())
res2=requests.post(url='http://httpbin.org/post',json={'age':'22',})
#默认的请求头：application/json
print(res2.json())

response对象

（1）常见属性

import requests
response=requests.get('http://sh.lianjia.com/ershoufang/')
#response
print(response.text)
print(response.content)
print(response.status_code)
print(response.headers)
print(response.cookies)
print(response.cookies.get_dict())
print(response.cookies.items())
print(response.url)
print(response.history)
print(response.encoding)

(2)编码问题

import requests
response = requests.get('http://www.autohome.com/news')
with open("res.html","w") as f:
f.write(response.text)

(3)下载二进制文件（图片，视频，音频）

import requests
response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg')
with open("res.png","wb") as f:
for line in response.iter_content():
f.write(line)

(4)解析json数据

import requests
import json
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text)
res2=response.json()
print(res1==res2)

(5)redirection and history

默认状况下，除了head，requests会自动处理全部重定向。可使用响应对象的history方法来追踪重定向。response.history是一个response对象的列表，为了完成请求而建立了这些对象。这个对象列表按照从最老到最近的请求进行排序

r = requests.get('http://github.com')
r.url
r.status_code
r.history

另外，还能够经过allow_redirests参数禁用重定向处理：

r= requests.get('http://github.com',allow_redirects=False)
r.status_code
r.history

2.二、resquests进阶用法

代理

一些网站会有相应的反爬虫措施，例如不少网站会检测某一段时间某个IP的访问次数，若是访问频率太快以致于看起来不像正常访客，它可能就会会禁止这个IP的访问。因此咱们须要设置一些代理服务器，每隔一段时间换一个代理，就算IP被禁止，依然能够换个IP继续爬取。

res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json()
print(res)

2.3.爬虫案例

import requests import re #第一步：请求获取token，以便经过post请求校验 session=requests.session() res=session.get("https://github.com/login") authenticity_token=re.findall('name="authenticity_token" value="(.*?)"',res.text)[0] print(authenticity_token) # 第二步构建post请求数据 data={ "login": "yuanchenqi0316@163.com", "password":"yuanchenqi0316", "commit": "Sign in", "utf8": "✓", "authenticity_token": authenticity_token } res=session.post("https://github.com/session",data=data,headers=headers,cookies=cookies) with open("github.html","wb") as f: f.write(res.content)