爬虫 + 数据

时间 2019-11-10

原文原文链接

爬虫 + 数据 - day01

启动：jupyter notebook
介绍：
anaconda是一个集成环境（数据分析+机器学习）
提供了一个叫作jupyter的可视化工具（基于浏览器）
jupyter的基本使用
快捷键：
插入cell：a，b
删除：x
执行：shift+enter
切换cell的模式：y，m
tab：自动补全
打开帮助文档：shift+tab

1. 什么是爬虫 :

经过编写程序模拟浏览器上网,从互联网中爬取须要的数据的过程

2. 爬虫的分类 :

通用爬虫 : 爬取一整张页面源码数据.搜索引擎 (抓取系统→内部封好的一套爬虫程序) 重点使用的是该种形式爬虫
聚焦爬虫 : 抓取页面中指定的局部数据
增量式爬虫 : 监测网站数据更新的状况.抓取网站最新更新的数据

3. 爬虫安全性的探究

风险所在php
- 爬虫干扰了被访问网站的正常运营；
- 爬虫抓取了受到法律保护的特定类型的数据或信息
如何规避风险css
- 严格遵照网站设置的robots协议；
- 在规避反爬虫措施的同时，须要优化本身的代码，避免干扰被访问网站的正常运行；
- 在使用、传播抓取到的信息时，应审查所抓取的内容，如发现属于用户的我的信息、隐私或者他人的商业秘密的，应及时中止并删除
爬虫机制 :应用在网站中html

反反爬机制 : 应用在爬虫程序中python

第一个反爬机制 :web

robots协议：纯文本的协议面试
- 特色：防君子不防小人

4. http & https

什么是http协议
- 服务器和客户端进行数据交互的某种形式
https - 安全 (数据加密) 的http协议

头部信息

一、通用头部

通用头域包含请求和响应消息都支持的头域。ajax

Request URL:请求的URL地址
Request Method: 请求方法，get/post/put/……
Status Code：状态码，200 为请求成功
Remote Address：路由地址

二、请求头部

1） Accept：  告诉WEB服务器本身接受什么介质类型，*/* 表示任何类型，type/* 表示该类型下的全部子类型；
2）Accept-Charset：  浏览器申明本身接收的字符集
  Accept-Encoding：浏览器申明本身接收的编码方法，一般指定压缩方法，是否支持压缩，支持什么压缩方法（gzip，     deflate）
3）Accept-Language：  浏览器申明本身接收的语言。语言跟字符集的区别：中文是语言，中文有多种字符集，好比big5，gb2312，gbk等等。
4）Authorization：  当客户端接收到来自WEB服务器的 WWW-Authenticate 响应时，该头部来回应本身的身份验证信息给WEB服务器。
5）Connection：表示是否须要持久链接。close（告诉WEB服务器或者代理服务器，在完成本次请求的响应后，断开链接，
     不要等待本次链接的后续请求了）。keep-alive（告诉WEB服务器或者代理服务器，在完成本次请求的响应后，保持链接，等待本次链接的后续请求）。
6）Referer：发送请求页面URL。浏览器向 WEB 服务器代表本身是从哪一个 网页/URL 得到/点击 当前请求中的网址/URL。
7）User-Agent: 浏览器代表本身的身份（是哪一种浏览器）。
8）Host： 发送请求页面所在域。
9）Cache-Control：浏览器应遵循的缓存机制。
       no-cache（不要缓存的实体，要求如今从WEB服务器去取）
       max-age：（只接受 Age 值小于 max-age 值，而且没有过时的对象） 
       max-stale：（能够接受过去的对象，可是过时时间必须小于 max-stale 值）  
       min-fresh：（接受其新鲜生命期大于其当前 Age 跟 min-fresh 值之和的缓存对象）
10）Pramga：主要使用 Pramga: no-cache，至关于 Cache-Control： no-cache。
11）Range：浏览器（好比 Flashget 多线程下载时）告诉 WEB 服务器本身想取对象的哪部分。
12）Form：一种请求头标，给定控制用户代理的人工用户的电子邮件地址。
13）Cookie：这是最重要的请求头信息之一

三、响应头部

1）Age：当代理服务器用本身缓存的实体去响应请求时，用该头部代表该实体从产生到如今通过多长时间了。
2）Accept-Ranges：WEB服务器代表本身是否接受获取其某个实体的一部分（好比文件的一部分）的请求。bytes：表示接受，none：表示不接受。
3） Cache-Control：服务器应遵循的缓存机制。
    public(能够用 Cached 内容回应任何用户)
    private（只能用缓存内容回应先前请求该内容的那个用户）
    no-cache（能够缓存，可是只有在跟WEB服务器验证了其有效后，才能返回给客户端） 
    max-age：（本响应包含的对象的过时时间）  
    ALL:  no-store（不容许缓存）  
4） Connection： 是否须要持久链接
        close（链接已经关闭）。
        keepalive（链接保持着，在等待本次链接的后续请求）。
        Keep-Alive：若是浏览器请求保持链接，则该头部代表但愿 WEB 服务器保持链接多长时间（秒）。例如：Keep-                     Alive：300
5）Content-Encoding：WEB服务器代表本身使用了什么压缩方法（gzip，deflate）压缩响应中的对象。 例如：Content-Encoding：gzip 
6）Content-Language：WEB 服务器告诉浏览器本身响应的对象的语言。
7）Content-Length：WEB 服务器告诉浏览器本身响应的对象的长度。例如：Content-Length: 26012
8）Content-Range：WEB 服务器代表该响应包含的部分对象为整个对象的哪一个部分。例如：Content-Range: bytes 21010-47021/47022
9）Content-Type：WEB 服务器告诉浏览器本身响应的对象的类型。例如：Content-Type：application/xml
10）Expired：WEB服务器代表该实体将在何时过时，对于过时了的对象，只有在跟WEB服务器验证了其有效性后，才能用来响应客户请求。
11） Last-Modified：WEB 服务器认为对象的最后修改时间，好比文件的最后修改时间，动态页面的最后产生时间等等。
12） Location：WEB 服务器告诉浏览器，试图访问的对象已经被移到别的位置了，到该头部指定的位置去取。
13）Proxy-Authenticate： 代理服务器响应浏览器，要求其提供代理身份验证信息。
14）Server: WEB 服务器代表本身是什么软件及版本等信息。
15）Refresh：表示浏览器应该在多少时间以后刷新文档，以秒计。

https的加密方式

对称密钥加密
非对称密钥加密
证书密钥加密

5. request模块

基于网络请求的python模块json

做用：模拟浏览器发送请求，实现爬虫flask

环境安装： pip install requestapi

编码流程：

指定url
发起请求
获取响应数据
持久化存储

1. 爬取搜狗首页的页面源码数据

import requests
#1.指定url
url = 'https://www.sogou.com/'
#2.请求发送:get返回的是一个响应对象
response = requests.get(url=url)
#3.获取响应数据:text返回的是字符串形式的响应数据
page_text = response.text
#4.持久化存储
with open('./sogou.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

2. 实现一个简易的网页采集器

请求参数的动态化

url = 'https://www.sogou.com/web'
#请求参数的动态化
wd = input('enter a key word:')
params = {
    'query':wd
}
response = requests.get(url=url,params=params)
page_text = response.text
fileName = wd+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
    fp.write(page_text)
print(fileName,'爬取成功！')

上述代码问题：

乱码问题
- response.encoding = 'xxx'
数据丢失
- 反爬机制：UA检测
- 反反爬策略：UA假装

#乱码问题的解决
url = 'https://www.sogou.com/web'
#请求参数的动态化
wd = input('enter a key word:')
params = {
    'query':wd
}

response = requests.get(url=url,params=params)

#将响应数据的编码格式手动进行指定
response.encoding = 'utf-8'
page_text = response.text
fileName = wd+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
    fp.write(page_text)
print(fileName,'爬取成功！')

#UA假装操做
url = 'https://www.sogou.com/web'
#请求参数的动态化
wd = input('enter a key word:')
params = {
    'query':wd
}

#UA假装
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
response = requests.get(url=url,params=params,headers=headers)

#将响应数据的编码格式手动进行指定
response.encoding = 'utf-8'
page_text = response.text
fileName = wd+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
    fp.write(page_text)
print(fileName,'爬取成功！')

3. 动态加载的数据

经过另外一个网络请求 (ajax) 请求到的数据

爬取豆瓣电影中动态加载出的电影详情数据 :

url = 'https://movie.douban.com/j/chart/top_list'
#参数动态化
params = {
    'type': '17',
    'interval_id': '100:90',
    'action': '',
    'start': '0',
    'limit': '200',
}
response = requests.get(url=url,params=params,headers=headers)
#json()返回的是序列化好的对象
movie_list = response.json()
for movie in movie_list:
    print(movie['title'],movie['score'])

总结：对一个陌生的网站进行数据爬取的时候，首先肯定的一点就是爬取的数据是否为动态加载出来的
    是:须要经过抓包工具捕获到动态加载数据对应的数据包，从中提取出url和请求参数。
    不是：直接对浏览器地址栏的url发起请求便可
如何检测爬取的数据是否是动态加载出来的？
    经过抓包工具进行局部搜索就能够验证数据是否为动态加载
        搜索到：不是动态加载
        搜索不到：是动态加载
如何定位动态加载的数据在哪呢？
    经过抓包工具进行全局搜索进行定位

4. 爬取肯德基的餐厅位置信息

http://www.kfc.com.cn/kfccda/storelist/index.aspx

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
data = {
    'cname': '',
    'pid': '',
    'keyword': '上海',
    'pageIndex': '1',
    'pageSize': '10',
}
address_dic = requests.post(url=url,data=data,headers=headers).json()
for dic in address_dic['Table1']:
    print(dic['addressDetail'])

5. 面试题

- 需求
https://www.fjggfw.gov.cn/Website/JYXXNew.aspx 福建省公共资源交易中心
提取内容:
工程建设中的中标结果信息/中标候选人信息
1. 完整的html中标信息
2. 第一中标候选人
3. 中标金额
4. 中标时间
5. 其它参与投标的公司

- 实现思路
    - 确认爬取的数据都是动态加载出来的
    - 在首页中捕获到ajax请求对应的数据包，从该数据包中提取出请求的url和请求参数
    - 对提取到的url进行请求发送，获取响应数据（json）
    - 从json串中提取到每个公告对应的id值
    - 将id值和中标信息对应的url进行整合，进行请求发送捕获到每个公告对应的中标信息数据

post_url = 'https://www.fjggfw.gov.cn/Website/AjaxHandler/BuilderHandler.ashx'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    'Cookie': '_qddac=4-3-1.4euvh3.x4wulp.k1hj8mnw; ASP.NET_SessionId=o4xkycpib3ry5rzkvfcamxzk; Hm_lvt_94bfa5b89a33cebfead2f88d38657023=1570520304; __root_domain_v=.fjggfw.gov.cn; _qddaz=QD.89mfu7.7kgq8w.k1hj8mhg; _qdda=4-1.4euvh3; _qddab=4-x4wulp.k1hj8mnw; _qddamta_2852155767=4-0; _qddagsx_02095bad0b=2882f90558bd014d97adf2d81c54875229141367446ccfed2b0c8913707c606ccf30ec99a338fed545821a5ff0476fd6332b8721c380e9dfb75dcc00600350b31d85d17d284bb5d6713a887ee73fa35c32b7350c9909379a8d9f728ac0c902e470cb5894c901c4176ada8a81e2ae1a7348ae5da6ff97dfb43a23c6c46ec8ec10; Hm_lpvt_94bfa5b89a33cebfead2f88d38657023=1570520973'
}
data = {
    'OPtype': 'GetListNew',
    'pageNo': '1',
    'pageSize': '10',
    'proArea': '-1',
    'category': 'GCJS',
    'announcementType': '-1',
    'ProType': '-1',
    'xmlx': '-1',
    'projectName': '',
    'TopTime': '2019-07-10 00:00:00',
    'EndTime': '2019-10-08 23:59:59',
    'rrr': '0.7293828344656237',
}
post_data = requests.post(url=post_url,headers=headers,data=data).json()
for dic in post_data['data']:
    _id = int(dic['M_ID'])
    detail_url = 'https://www.fjggfw.gov.cn/Website/AjaxHandler/BuilderHandler.ashx?OPtype=GetGGInfoPC&ID={}&GGTYPE=5&url=AjaxHandler%2FBuilderHandler.ashx'.format(_id)
    company_data = requests.get(url=detail_url,headers=headers).json()['data']
    company_str = ''.join(company_data)
    print(company_str)

6. 数据解析

1. 如何爬取图片数据？

- 基于requests|
- 基于urllib
- 区别：urllib中的urlretrieve不能够进行UA假装

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}

#基于requests的图片爬取
url = 'http://tva1.sinaimg.cn/mw600/007QUzsKgy1g7qzr59hk7j30cs0gxn82.jpg'
img_data = requests.get(url=url,headers=headers).content #content返回的是byte类型的响应数据
with open('./123.jpg','wb') as fp:
    fp.write(img_data)

#基于urllib的图片爬取
from urllib import request
url = 'http://tva1.sinaimg.cn/mw600/007QUzsKgy1g7qzr59hk7j30cs0gxn82.jpg'
request.urlretrieve(url,'./456.jpg')

2. 数据解析

数据解析

概念：将一整张页面中的局部数据进行提取/解析
做用：用来实现聚焦爬虫的吧
实现方式：
- 正则
- bs4
- xpath
- pyquery
数据解析的通用原理是什么？
- 标签的定位
- 数据的提取
页面中的相关的字符串的数据都存储在哪里呢？
- 标签中间
- 标签的属性中

- 基于聚焦爬虫的编码流程
    - 指定url
    - 发起请求
    - 获取响应数据
    - 数据解析
    - 持久化存储

正则解析

- 将煎蛋网中的图片数据进行爬取且存储在本地 :

import re
import os

dirName = './imgLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)
    
url = 'http://jandan.net/pic/MjAxOTEwMDktNjY=#comments'
page_text = requests.get(url,headers=headers).text
#解析数据：img标签的src的属性值
ex = '<div class="text">.*?<img src="(.*?)" referrerPolicy.*?</div>'
img_src_list = re.findall(ex,page_text,re.S)
for src in img_src_list:
    if 'org_src' in src:
        src = re.findall('org_src="(.*?)" onload',src)[0]
    src = 'http:'+src
    imgName = src.split('/')[-1]
    imgPath = dirName+'/'+imgName
    request.urlretrieve(src,imgPath)
    print(imgName,'下载成功！！！')

bs4解析

- 环境的安装：
  - pip install bs4
  - pip install lxml
- bs4的解析原理：
  - 实例化一个BeautifulSoup的一个对象，把即将被解析的页面源码数据加载到该对象中
  - 须要调用BeautifulSoup对象中的相关的方法和属性进行标签订位和数据的提取
- BeautifulSoup的实例化
  - BeautifulSoup(fp,'lxml'):将本地存储的html文档中的页面源码数据加载到该对象中
  - BeautifulSoup（page_text,'lxml'）:将从互联网中请求道的页面源码数据加载到改对象中
- 标签的定位
  - soup.tagName:只能够定位到第一个tagName标签
  - 属性定位：soup.find('tagName',attrName='value'),只能够定位到符合要求的第一个标签
    - findAll:返回值是一个列表。能够定位到符合要求的全部标签
  - 选择器定位：soup.select('选择器')
    - 选择器：id，class，tag,层级选择器(大于号表示一个层级，空格表示多个层级)
- 取文本
  - text:将标签中全部的文本取出
  - string:将标签中直系的文本取出
- 取属性
  - tag['attrName']

from bs4 import BeautifulSoup
fp = open('./test.html',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
# soup.div
# soup.find('div',class_='song')
# soup.findAll('div',class_='song')
# soup.select('#feng')[0]
# soup.select('.tang > ul > li > a')
# soup.select('.tang a')
# tag = soup.b
# tag.string
# div_tag = soup.find('div',class_='tang')
# div_tag.text
a_tag = soup.select('#feng')[0]
a_tag

- 使用bs4解析三国演义小说的标题和内容，存储到本地 :

main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=main_url,headers=headers).text
#数据解析：章节的标题和详情页的url
soup = BeautifulSoup(page_text,'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('./sanguo.txt','w',encoding='utf-8')
for a in a_list:
    title = a.string
    detail_url = 'http://www.shicimingju.com'+a['href']
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    #数据解析：章节内容
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    div_tag = detail_soup.find('div',class_='chapter_content')
    content = div_tag.text
    
    fp.write(title+':'+content+'\n')
    print(title,'写入成功！！！')
fp.close()

xpath解析

- 环境的安装
  - pip install lxml
- 解析原理
  - 实例化一个etree的对象，且把即将被解析的页面源码数据加载到该对象中
  - 调用etree对象中的xpath方法结合这不一样形式的xpath表达式进行标签订位和数据提取
- etree对象的实例化
  - etree.parse('fileName')  - 本地文档
  - etree.HTML(page_text) - 网络请求
- 标签订位
  - 最左侧的/:必定要从根标签开始进行标签订位
  - 非最左侧的/:表示一个层级
  - 最左侧的//：能够从任意位置进行指定标签的定位
  - 非最左侧的//：表示多个层级
  - 属性定位：//tagName[@attrName="value"]
  - 索引定位：//tagName[@attrName="value"]/li[2],索引是从1开始
  - 逻辑运算：
    - 找到href属性值为空且class属性值为du的a标签
    - //a[@href="" and @class="du"]
  - 模糊匹配：
    - //div[contains(@class, "ng")]
    - //div[starts-with(@class, "ta")]
- 取文本
  - /text():直系的文本内容
  - //text()：全部的文本内容
- 取属性
  - /@attrName

from lxml import etree
tree = etree.parse('./test.html')
# tree.xpath('/html//title')
# tree.xpath('//div')
# tree.xpath('//div[@class="tang"]')
# tree.xpath('//div[@class="tang"]/ul/li[2]')
# tree.xpath('//p[1]/text()')
# tree.xpath('//div[@class="song"]//text()')
tree.xpath('//img/@src')[0]

需求：爬取虎牙主播名称，热度和标题

url = 'https://www.huya.com/g/xingxiu'
page_text = requests.get(url=url,headers=headers).text

#数据解析
tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="box-bd"]/ul/li')
for li in li_list:
    #实现的是页面局部数据的指定数据的解析
    title = li.xpath('./a[2]/text()')[0]
    author = li.xpath('./span/span[1]/i/text()')[0]
    hot = li.xpath('./span/span[2]/i[2]/text()')[0]
    
    print(title,author,hot)

爬取http://pic.netbian.com/4kmeinv/中前五页的图片数据
- 中文乱码的处理
- 多页码数据的爬取

# url = 'http://pic.netbian.com/4kmeinv/' #第一页
#指定一个通用的url模板:不可变的
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
dirName = './MZLib'
if not os.path.exists(dirName):
    os.mkdir(dirName)
    
for page in range(1,6):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/'
    else:
        new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text
    #数据解析:图片地址&图片名称
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
        img_name = li.xpath('./a/img/@alt')[0]
        img_name = img_name.encode('iso-8859-1').decode('gbk')+'.jpg'
        img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        img_data = requests.get(img_src,headers=headers).content #图片的二进制类型数据
        img_path = dirName+'/'+img_name
        with open(img_path,'wb') as fp:
            fp.write(img_data)
    print('第{}页爬取完毕！！！'.format(page))

爬取全国城市的名称
- https://www.aqistudy.cn/historydata/

url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
# hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
# all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text()')
tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()')

7. 代理

代理指的就是代理服务器
代理的做用 : 
    请求和响应数据的转发
代理和爬虫之间的关联 :
    能够基于代理实现更换爬虫程序请求的ip地址
代理网站 :
    1. 西祠 https://www.xicidaili.com/nn/
    2. 快代理
    3. www.goubanjia.comm
    4. 代理精灵 http://http.zhiliandaili.cn/
代理的匿名度 :
    高匿 : 所访问的服务器察觉不到是不是代理访问,也没法知晓真正访问的ip
    匿名 : 所访问的服务器知道是代理访问,但没法查到真正的ip
    透明 : 知道是代理,而且知道真实ip
类型 :
    http
    https

# 使用代理发请求 
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    'Connection':'close'
}
url = 'https://www.baidu.com/s?ie=UTF-8&wd=ip'
page_text = requests.get(url,headers=headers,proxies={'https':'125.87.99.237:22007'}).text
with open('./ip.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

搭建一个免费的代理池 (利用付费代理ip爬取免费代理网站的ip)

#构建一个付费的代理池
import random
ips_pool = []
url = 'http://ip.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=103&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
ip_list = tree.xpath('//body//text()')
for ip in ip_list:
    dic = {'https':ip}
    ips_pool.append(dic)

from lxml import etree
url = 'https://www.xicidaili.com/nn/%d' #通用的url模板（不可变）
all_ips = []
for page in range(1,5):
    new_url = format(url%page)
    page_text = requests.get(new_url,headers=headers,proxies=random.choice(ips_pool)).text
    tree = etree.HTML(page_text)
    #在xpath表达式中不能够出现tbody标签
    tr_list = tree.xpath('//*[@id="ip_list"]//tr')[1:]
    for tr in tr_list:
        ip = tr.xpath('./td[2]/text()')[0]
        port = tr.xpath('./td[3]/text()')[0]
        type_ip = tr.xpath('./td[6]/text()')[0]
        dic = {
            'ip':ip,
            'port':port,
            'type':type_ip
        }
        all_ips.append(dic)
                
print(len(all_ips))

需求：将https://xueqiu.com/中的新闻数据进行爬取
爬虫中处理cookie的操做
    手动处理：将cookie写在headers中
    自动处理：session对象。
获取session对象：requests.Session()
做用：
    session对象和requests对象均可以对指定的url进行请求发送。只不过使用session进行请求发送的过程当中若是产生了cookie则cookie会被自动存储在session对象中

url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20352188&count=15&category=-1'

news_json = requests.get(url,headers=headers).json()
news_json

#基于cookie操做的修正
session = requests.Session()
url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&max_id=20352188&count=15&category=-1'
#将cookie存储到session中，目的是将ｃｏｏｋｉｅ获取存储到ｓｅｓｓｉｏｎ中
session.get('https://xueqiu.com/',headers=headers) 

#保证该次请求时携带对应的cookie才能够请求成功
news_json = session.get(url,headers=headers).json()
news_json

9. 验证码的识别

使用线上的打码平台进行自动的识别:
    - 云打码
    - 超级鹰 :
        - 注册《用户中心》身份的帐户
        - 登录
            - 建立一个软件
            - 下载示例代码《开发文档》

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()
    
chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')  #用户中心>>软件ID 生成一个替换 96001
im = open('a.jpg', 'rb').read()                                                 #本地图片文件路径 来替换 a.jpg 有时WIN系统需要//
print(chaojiying.PostPic(im,1004)['pic_str'])   




#验证码识别函数的封装
def transformCode(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im,imgType)['pic_str']

模拟登录

版本一 :

版本一的问题 :

请求须要有动态的参数

一般请状况下动态变化的请求参数都会被隐藏在前台页面源码中

from urllib import request

#验证码的识别：将验证码下载到本地而后提交给打吗平台进行识别
main_url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = requests.get(main_url,headers=headers).text
tree = etree.HTML(page_text)
code_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
request.urlretrieve(code_src,'./code.jpg')

#识别验证码
code_text = transformCode('./code.jpg',1004)


login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': '8/BKAQBaZHn7+GP+Kl2Gx43fFO1NI32RMyVae0RyrtFQue3IAhzQKvkml41cIT42Y//OcQccA8AqGYkvB+NFkU43uaHqU69Y0Z1WT3ZRrr4vR+CF7JlBG29POXM=',
    '__VIEWSTATEGENERATOR': 'C93BE1AE',
    'from': 'http://so.gushiwen.org/user/collect.aspx',
    'email': 'www.zhangbowudi@qq.com',
    'pwd': 'bobo328410948',
    'code': code_text,
    'denglu': '登陆',
}
print(code_text)
page_text = requests.post(login_url,headers=headers,data=data).text

with open('./login.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

版本二 :

版本二遇到的问题 :

没有携带cookie ,且这个网站的cookie在验证码的请求里

#验证码的识别：将验证码下载到本地而后提交给打吗平台进行识别
main_url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = requests.get(main_url,headers=headers).text
tree = etree.HTML(page_text)
code_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
request.urlretrieve(code_src,'./code.jpg')

#解析出动态变化的请求参数
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

#识别验证码
code_text = transformCode('./code.jpg',1004)


login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': __VIEWSTATE,
    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
    'from': 'http://so.gushiwen.org/user/collect.aspx',
    'email': 'www.zhangbowudi@qq.com',
    'pwd': 'bobo328410948',
    'code': code_text,
    'denglu': '登陆',
}
print(code_text)
page_text = requests.post(login_url,headers=headers,data=data).text

with open('./login.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

版本三 (完美版):

s = requests.Session()
#验证码的识别：将验证码下载到本地而后提交给打吗平台进行识别
main_url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = s.get(main_url,headers=headers).text
tree = etree.HTML(page_text)
code_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]

# request.urlretrieve(code_src,'./code.jpg')
code_data = s.get(code_src,headers=headers).content
with open('./code.jpg','wb') as fp:
    fp.write(code_data)

#解析出动态变化的请求参数
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]

#识别验证码
code_text = transformCode('./code.jpg',1004)


login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': __VIEWSTATE,
    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
    'from': 'http://so.gushiwen.org/user/collect.aspx',
    'email': 'www.zhangbowudi@qq.com',
    'pwd': 'bobo328410948',
    'code': code_text,
    'denglu': '登陆',
}
print(code_text)
page_text = s.post(login_url,headers=headers,data=data).text

with open('./login.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

- 反爬机制
    - robots
    - UA检测
    - 图片懒加载 
    - 代理
    - cookie
    - 验证码
    - 动态变化的请求参数
    - 动态加载的数据

10. 使用线程池提高爬取数据的效率

# 同步操做

import time

start = time.time()
def request(url):
    print('正在请求',url)
    time.sleep(2)
    print('请求完毕：',url)
urls = [
    'www.1.com',
    'www.b.com',
    'www.3.com'
]

for url in urls:
    request(url)
print('总耗时：',time.time()-start)

# 异步操做

import time
from multiprocessing.dummy import Pool

start = time.time()
pool = Pool(3)
def request(url):
    print('正在请求',url)
    time.sleep(2)
    print('请求完毕：',url)

urls = [
    'www.1.com',
    'www.b.com',
    'www.3.com'
]

pool.map(request,urls)

print('总耗时：',time.time()-start)

# 爬虫+ 线程池
# server端
from flask import Flask,render_template
from  time import sleep
app = Flask(__name__)

@app.route('/bobo')
def index_bobo():
    sleep(2)
    return render_template('ip.html')

@app.route('/jay')
def index_jay():
    sleep(2)
    return render_template('login.html')
app.run()

# 爬虫 + 线程池
import time
from multiprocessing.dummy import Pool
import requests
from lxml import etree
start = time.time()
urls = [
    'http://localhost:5000/jay',
    'http://localhost:5000/bobo'
]

def get_request(url):
    page_text = requests.get(url).text
    return page_text


def parse(page_text):
    tree = etree.HTML(page_text)
    print(tree.xpath('//div[1]//text()'))


pool = Pool(2)
page_text_list = pool.map(get_request,urls)


pool.map(parse,page_text_list)


print(len(page_text_list))


print('总耗时：',time.time()-start)