Python爬虫合集：花6k学习爬虫，终于知道爬虫能干吗了

时间 2020-12-02

标签 php css html python jquery web ajax chrome json api 栏目 Python 繁體版

原文原文链接

爬虫Ⅰ:爬虫的基础知识

爬虫的基础知识使用实例、应用技巧、基本知识点总结和须要注意事项php

爬虫初始：

爬虫：css

+ Request
+ Scrapy

数据分析+机器学习html

+ numpy,pandas,matplotlib

jupyter:python

+ 启动：到你须要进去的文件夹，而后输入jupyter notebook

cell是分为不一样模式的：（Code:编写代码、markdown:编写笔记）jquery

jupyter的快捷键：web

添加cell: a, b (a向前添加，b前后添加)ajax

删除cell: xchrome

执行：shift+enter(执行而且光标到下一行)，ctrl+enter(执行而且光标仍然在这一行)json

tab:自动补全切换cell的模式：api

m :makedown模式
y：代码模式

打开帮助文档：shift + tab

爬虫简介（了解）：

一、什么是爬虫？

经过编写程序模拟浏览器上网，而后让其去互联网上爬取数据的过程

二、爬虫的分类：

通用爬虫：抓取互联网中的一整张页面数据

聚焦爬虫：抓取页面中的局部数据

增量式爬虫：用来监测网站数据更新的状况，以便爬取到网站最新更新出来的数据

三、反爬机制

四、反反爬策略

五、爬虫合法吗？

5.1爬取数据的行为风险的体现：

爬虫干扰了被访问网站的正常运营；

爬虫抓取了受到法律保护的特定类型的数据或信息。

5.2规避风险：

严格遵照网站设置的robots协议；

在规避反爬虫措施的同时，须要优化本身的代码，避免干扰被访问网站的正常运行；

在使用、传播抓取到的信息时，应审查所抓取的内容，如发现属于用户的我的信息、隐私或者他人的商业秘密的，应及时中止并删除。

六、robots协议：

文本协议特性：防君子不防小人的文本协议

request模块的基本使用：

什么是requests模块？Python中封装好的一个基于网络请求的模块。

requests模块的做用？用来模拟浏览器发请求

requests模块的环境安装：pip install requests

requests模块的编码流程：指定url、发起请求、获取响应数据、持久化存储

爬取搜狗首页源码数据

import requests
# 1.指定url
url = 'https://www.sogou.com/'
# 2.请求发送get:get返回值是一个响应对象
response = requests.get(url=url)
# 3.获取响应数据
page_text = response.text    # 返回的是字符串形式的响应数据
# 4.持久化存储
with open('sogou.html',mode='w',encoding='utf-8') as fp:
    fp.write(page_text)

实现一个简易的网页采集器

初版：

须要让url携带的参数动态化
import requests
url = 'https://www.sogou.com/web'
# 实现参数动态化
wd = input('enter a key:')
params = {
    'query': wd
}
# 在请求中须要将请求参数对应的字典做用到params这个get方法的参数中
response = requests.get(url=url, params=params)

page_text = response.text
file_name = wd+'.html'
with open(file_name,encoding='utf-8',mode='w') as fp:
    fp.write(page_text)

第二版：

上述代码运行后发现：出现了乱码、数据量级不对
解决乱码：解决响应数据的编码方式

import requests
url = 'https://www.sogou.com/web'
wd = input('enter a key')
params = {
    'query': wd
}
response = requests.get(url=url, params=params)
response.encoding = 'utf-8'
page_text = response.text
filename = wd + '.html'
with open(filename, mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

第三版:（加一个headers,模拟浏览器登入）

UA检测：门户网站经过检测请求载体的身份标识断定该请求是否为爬虫发起的请求
UA假装：Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36

import requests
url = 'https://www.sogou.com/web'
wd = input('enter a key')
params = {
    'query': wd
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
response = requests.get(url=url, params=params, headers=headers)
response.encoding = 'utf-8'
page_text = response.text
filename = wd + '.html'
with open(filename, mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

当网页发生局部刷新

爬取的是豆瓣电影中电影的详情数据https://movie.douban.com/typerank?type_name=%E7%88%B1%E6%83%85&type=13&interval_id=100:90&action=
分析：当滚动条被滑动到页面底部的时候，当前页面发生了局部刷新（ajax的请求）

动态加载的页面数据
是经过例一个单独的请求请求到的数据
import requests
url = 'https://movie.douban.com/j/chart/top_list'
start = input('电影开始')
end = input('电影结束')
dic = {
    'type': '13',
    'interval_id': '100:90',
    'action': '',
     'start': start,
     'end': end
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
response = requests.get(url=url, params=dic, headers=headers)
page_text = response.json()    # json返回的是序列化好的实例对象
for dic in page_text:
    print(dic['title']+dic['score'])

请求为post是参数为data——肯德基案例

肯德基餐厅查询http://www.kfc.com.cn/kfccda/storelist/index.aspx
注意：get请求参数时params,可是post请求参数时data

import requests
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
site = input('请输入地点>>')
for page in range(1, 5):
    data = {
        'cname':'',
            'pid':'',
    'keyword': site,
    'pageIndex': '1',
    'pageSize': '10'
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    }
    response = requests.post(url=url, data=data,headers=headers)
    print(response.json())

数据解析简介

数据解析

数据解析的做用：能够帮助咱们实现聚焦爬虫

数据解析的实现方式：正则、bs四、xpath、pyquery

数据解析的通用原理：

1.爬虫爬取的数据都被存储在了相关的标签之中和相关标签的属性中

2.定位标签

3.取文本或者取属性

爬取byte类型数据

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}

1.爬取byte类型数据(如何爬取图片)
url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg'
img_data = requests.get(url=url).content    # 爬取byte类使用.content
with open('./img.jpg',mode='wb') as fp:
    fp.write(img_data)


# 弊端：不能使用UA假装
from urllib import request
# url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg'
# request.urlretrieve(url, filename='./qutu.jpg')

正则解析

先用通用模板找到对应的网页，再用正则找到你所须要的内容

import os
import re
# 糗图爬取1-3页全部的图片
# 1.使用通用爬虫将前3页对应的页面源码数据进行爬取
# 通用的url模板(不可变)
1.建立目录
dirName = "./imgLibs"
if not os.path.exists(dirName):
    os.mkdir(dirName)
url = f"https://www.qiushibaike.com/imgrank/page/%d/"
# 2.下载图片
for page in range(1, 3):
    new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text    # 每个页码对应的源码数据
    ex = '<div class="thumb">.*?<img src="(.*?)".*?</div>'
    img_src_list = re.findall(ex, page_text, re.S)
    for src in img_src_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dirName + '/' + img_name    #./imgLibs/xxxx.jpg
        request.urlretrieve(src, filename=img_path)
        print(img_name, '下载成功')

bs4解析

1.知识点：

bs4解析bs4解析的原理：

实例化一个BeautifulSoup的对象，须要将即将被解析的页面源码数据加载到该对象中

调用BeautifulSoup对象中的相关方法和属性进行标签订位和数据提取

环境的安装：

pip install bs4

pip install lxml

BeautifulSoup的实例化：

BeautifulSoup(fp,'lxml')：将本地存储的一个html文档中的数据加载到实例化好的BeautifulSoup对象中

BeautifulSoup(page_text,'lxml')：将从互联网上获取的页面源码数据加载到实例化好的BeautifulSoup对象中

定位标签的操做：

soup.tagName：定位到第一个出现的tagName标签

属性定位：soup.find('tagName',attrName='value')

属性定位:soup.find_all('tagName',attrName='value'),返回值为列表

选择器定位：soup.select('选择器'),返回的是列表

层级选择器：>表示一个层级空格表示多个层级

取文本：

string:获取直系的文本内容

.text:获取全部的文本内容

取属性：

tagName['attrName']

2.代码示例：

定位标签
from bs4 import BeautifulSoup
fp = open('./test.html', mode='r', encoding='utf-8')
soup = BeautifulSoup(fp, 'lxml')

print(soup.div)    # 定位到第一个出现的div
find相关
print(soup.find('div', class_='song'))    # 只有class_标签须要带_
print(soup.find('a', id='feng'))
print(soup.find_all('div', class_='song'))    # 返回的是一个列表
select相关
print(soup.select('#feng'))    # 返回的是一个列表
print(soup.select('.tang > ul >li'))    # 返回的是一个列表 > 表示一个层级
print(soup.select('.tang li'))    # 返回一个列表  空格表示多个层级
取文本
a_tag = soup.select("#feng")[0]
print(a_tag.text)
div = soup.div
print(div.string)    # 取直系的文本内容
div = soup.find('div', class_='song')
print(div.string)
a_tag = soup.select('#feng')[0]
print(a_tag['href'])

3.具体案例：

爬取三国整篇内容（章节名称+章节内容）http://www.shicimingju.com/book/sanguoyanyi.html
fp = open('./sanguo.txt', mode='w', encoding='utf-8')
main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=main_url, headers=headers).text
soup1 = BeautifulSoup(page_text, 'lxml')
title_list = soup1.select('.book-mulu > ul > li > a')
for page in title_list:
    title = page.string
    title_url = 'https://www.shicimingju.com' + page['href']
    title_text = requests.get(url=title_url, headers=headers).text
    # 解析详情页中的章节内容
    soup = BeautifulSoup(title_text, 'lxml')
    content = soup.find('div', class_='chapter_content').text
    fp.write(title + ':' + content + '\n')
    print(f'{title}下载成功')

xpath解析

1. 知识点：

xpath解析的实现原理：

1.实例化一个etree的对象，而后将即将被解析的页面源码加载到该对象中

2.使用etree对象中的xpath方法结合着不一样形式的xpath表达式实现标签订位和数据提取

环境安装：

pip install lxmletree

对象的实例化：

etree.parse('test.html') # 本地文件

etree.HTML(page_text) # 互联网页面

xpath表达式：xpath方法的返回值必定是一个列表

最左侧的/表示：xpath表达式必定要从根标签逐层进行标签查找和定位

最左侧的//表示：xpath表达式能够从任意位置定位标签

非最左侧的/:表示一个层级

非最左侧的//：表示跨多个层级

属性定位：//tagName[@attrName="value"]

索引定位：//tagName[index] 索引是从1开始

取文本：/text():直系文本内容//text():全部的文本内容

取属性：/@attrName

2.代码示例：

from lxml import etree
tree = etree.parse('./test.html')
标签订位
print(tree.xpath('/html/head/title'))
print(tree.xpath('//title'))
print(tree.xpath('/html/body//p'))
print(tree.xpath('//p'))
属性定位
print(tree.xpath('//div[@class="song"]'))
print(tree.xpath('//li[3]'))    # 返回的是一个对象地址
取文本
print(tree.xpath('//a[@id="feng"]/text()')[0])    # 返回的是列表
print(tree.xpath('//div[@class="song"]//text()'))    # 返回的是列表
取属性
print(tree.xpath('//a[@id="feng"]/@href'))    # 返回的是列表

3.具体案例：

#爬取糗百中的段子内容和做者名称
url = 'https://www.qiushibaike.com/text/'
page_text = requests.get(url,headers=headers).text

#解析内容
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@id="content-left"]/div')
for div in div_list:
    author = div.xpath('./div[1]/a[2]/h2/text()')[0]#实现局部解析
    content = div.xpath('./a[1]/div/span//text()')
    content = ''.join(content)
    
    print(author,content)

4.提升xpath通用性

https://www.aqistudy.cn/historydata/ 爬取全部城市名称
url = 'https://www.aqistudy.cn/historydata/'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
print(tree)
city_list1 = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
print(city_list1)
city_list2 = tree.xpath('//ul[@class="unstyled"]//li/a/text()')
print(city_list2)
利用|提升xpath的通用性(当前面表达式生效时执行前面，后面表达式生效时执行后面。两个同时生效时同时执行)
cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //ul[@class="unstyled"]//li/a/text()')
print(cities)

中文乱码的处理

#http://pic.netbian.com/4kmeinv/中文乱码的处理  
dirName = './meinvLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
for page in range(1,11):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/' 
    else:
        new_url = format(url%page)
    page_text = requests.get(new_url,headers=headers).text
    tree = etree.HTML(page_text)
    a_list = tree.xpath('//div[@class="slist"]/ul/li/a')
    for a in a_list:
        img_src = 'http://pic.netbian.com'+a.xpath('./img/@src')[0]
        img_name = a.xpath('./b/text()')[0]
        img_name = img_name.encode('iso-8859-1').decode('gbk')    # 对乱码部分进行编码解码
        img_data = requests.get(img_src,headers=headers).content
        imgPath = dirName+'/'+img_name+'.jpg'
        with open(imgPath,'wb') as fp:
            fp.write(img_data)
            print(img_name,'下载成功！！！')

代理

HttpConnectionPool:错误缘由

缘由：短期发起高频的请求致使ip被禁http链接池中的链接资源被消耗尽
解决：代理headers中加入Conection: "close"

代理：代理服务器，能够接受请求而后将其转发

匿名度：

高匿：啥也不知道
匿名：知道你使用了代理，可是不知道你的真实ip
透明：知道你使用了代理而且知道你的真实ip

类型：http、https

免费代理：www.goubanjia.com、快代理西

cookie的处理

使用代理的简单案例

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Connection': 'close'
}
 # url = 'https://www.baidu.com/s?wd=ip'
 url = 'http://ip.chinaz.com/'
 page_text = requests.get(url=url, headers=headers, proxies={'http': '123.169.122.111:9999'}).text
 with open('./ip.html', mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

代理池

代理池:构建本身的代理池

import random
proxy_list = {
    {'https': '121.231.94.44:8888'},
    {'https': '131.231.94.44:8888'},
    {'https': '141.231.94.44:8888'}
}
url = 'https://www.baidu.com/s?wd=ip'
page_text = requests.get(url=url, headers=headers, proxies=random.choice(proxy_list)).text
with open('ip.html', 'w', enconding='utf-8') as fp:
    fp.write(page_text)

从代理精灵中提取代理ip,为了获取一系列ip来构建本身的代理池

from lxml import etree
ip_url = 'http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=4&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'
page_text = requests.get(ip_url, headers=headers).text
tree = etree.HTML(page_text)
ip_list = tree.xpath('//body//text()')
print(ip_list)

第一步:爬取ip,port http类型

爬取西祠代理(已挂)获取可用ip构建本身代理池

import random
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Connection': "close"
}
# url = 'https://www.xicidaili.com/nn/%d'    # 西祠代理(已挂)
url = 'https://www.kuaidaili.com/free/inha/%d/'
proxy_list_http = []
proxy_list_https = []
for page in range(1, 20):
    new_url = format(url%page)
    ip_port = random.choice(ip_list)
    page_text = requests.get(new_url, headers=headers, proxies={'https': ip_port}).text
    tree = etree.HTML(page_text)
    # tbody不能够出如今xpath表达式中
    tr_list = tree.xpath('//*[@id="list"]/table//tr')[1:]    # 这里不能要tbody,索引是从1开始的
    for tr in tr_list:
        ip = tr.xpath('./td[1]/text()')[0]    # 返回的是一个列表
        port = tr.xpath('./td[2]/text()')[0]
        t_type = tr.xpath('/td[4]/text()')[0]
        ips = ip+":" + port
        if t_type == 'HTTP':
            dic = {
                t_type: ips
            }
            proxy_list_http.append(dic)
        else:
            dic = {
                t_type: ips
            }
            proxy_list_https.append(dic)
print(len(proxy_list_http), len(proxy_list_https))

第二步: 检测,将可用的ip留下来

for ip in proxy_list_http:
    response = requests.get('https://www/sogou.com', headers=headers,proxies={'https': ip})
    if response.status_code == '200':
        print('检测到了可用的ip')

cookie的处理

cookie的处理:

手动处理：将cookie封装到headers中

自动处理：session对象.能够建立一个session对象,该对象能够像requests同样进行请求发送；

不一样之处在于若是在使用session进行请求发送的过程当中产生了cookie,则cookie会被自动存储在session对象中

案例:对雪球网中的新闻数据进行爬取https://xueqiu.com/

手动

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    'Cookie':'device_id=24700f9f1986800ab4fcc880530dd0ed; xq_a_token=db48cfe87b71562f38e03269b22f459d974aa8ae; xqat=db48cfe87b71562f38e03269b22f459d974aa8ae; xq_r_token=500b4e3d30d8b8237cdcf62998edbf723842f73a; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYwNjk2MzA1MCwiY3RtIjoxNjA1NTM1Mjc2NzYxLCJjaWQiOiJkOWQwbjRBWnVwIn0.PhEaPnWolUZRgyuOY-QO04Bn_A_HYU46Hm54_kWBxa8IZ6cFw20trOr7rKp7XztprxEFc7fkMN2_5abfh1TUyyFKqTDn7IfoThXyJ2lJCnH33q1q-K9BclYvLHrLGqt8jQ3YOJi7-nyiSb5ZTNk7TLEhiFfsbXaZK9evNrt7W65MdxoEWyCcGjbhI5znffRxDDLHD9511bd9upY9CUGbf4SHQwwx4PxyQqdy9j5bgqPN6rsuHoCvjcr42DZYRd8B72uQTkFs-Lnru4AFxt4o4gdaxPo_Qd_IqzCrXnwoLtCdX6n4NKV44SryBttE0SKQC6UbqC35PwN-JqPeWCHKpQ; u=201605535281005; Hm_lvt_1db88642e346389874251b5a1eded6e3=1605354060,1605411081,1605535282; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1605535282'
}
params = {
    'status_id': '163425862',
    'page': '1',
    'size': '14'
}
url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14'
page_text = requests.get(url=url, headers=headers, params=params).json()
print(page_text)

自动:将Cookie永久存储session

session = requests.Session()
session.get('https://xueqiu.com/', headers=headers)    # 自动处理cookie,将首页的cookie存储到session中，后面爬取其余页面时能够用到
url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14'
page_text = session.get(url=url, headers=headers).json()
print(page_text)

验证码的识别

step1:超级鹰中的示例代码

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):    # 用户名，密码，和软件id
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()

step2:识别古诗文网中的验证码

def tranformImgData(imgPath, t_type):    # 验证码图片的地址和验证码的类型
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')    # 须要注册的超级鹰的用户名，密码，和软件id
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, t_type)['pic_str']    # t_type为该图片的类型码
# 从古诗文网中爬取验证码的图片，将图片保存到本地，而后将图片送入到超级鹰中识别，最后返回识别结果
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]    # 它返回的是一个列表
img_data = requests.get(img_src, headers=headers).content    # .content时爬取图片数据
with open('./code.jpg', 'wb') as fp:
    fp.write(img_data)
tranformImgData('./code.jpg', 1004)    # 将图片路径和图片类型输入进去，返回识别出来的码

step3:重要模拟登入，目的是为了将cookie保存到session中

# 将上述产生的验证码进行模拟登入
s = requests.Session()
url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
page_text = s.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_src, headers=headers).content    # cookie的产生在发生验证码图片时产生,目的是：1.产生cookie，2：产生图片
with open('./code.jpg', 'wb') as fp:
    fp.write(img_data)

# 动态获取变化的参数
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
# 获取前面超级鹰得到的验证码(将图片识别出来)
code_text = tranformImgData('./code.jpg', 1004)
print(code_text)    # 观察是否正确
# 该login_url是点击登入按钮后出现的页面，为post请求
login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx'
data = {
    '__VIEWSTATE': __VIEWSTATE,
    '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
    'from':'http://so.gushiwen.org/user/collect.aspx',
    'email': 'www.zhangbowudi@qq.com',
    'pwd': 'bobo328410948',
    'code': code_text,
    'denglu': '登陆',
}
page_text = s.post(url=login_url, headers=headers, data=data).text
with open('login.html', mode='w', encoding='utf-8') as fp:
    fp.write(page_text)

单线程+多任务异步协程

协程：

在函数（特殊的函数）定义的时候，若是使用了async修饰的话，则该函数调用后会返回一个协程对象，而且函数内部的实现语句不会被当即执行

任务对象

任务对象就是对协程对象的进一步封装。任务对象高级的协程对象特殊的函数

任务对象时必需要注册到事件循环对象中

给任务对象绑定回调：爬虫的数据解析中

事件循环

当作是一个容器，容器中必须存听任务对象；

当启动事件循环对象后，则事件循环对象会对其内部存储任务对象进行异步的执行。

aiohttp：支持异步网络请求的模块

1.协程

import asyncio
def callback(task):#做为任务对象的回调函数
    print('i am callback and ',task.result())    # task.result()是接受特殊函数内部的返回值

async def test():
    print('i am test()')
    return 'bobo'

c = test()
# 封装了一个任务对象，就是对协程对象的进一步封装
task = asyncio.ensure_future(c)    # 封装一个任务对象
task.add_done_callback(callback)    # 给任务对象绑定回调
#建立一个事件循环的对象
loop = asyncio.get_event_loop()    # 建立事件循环对象
loop.run_until_complete(task)    # 将任务对象注册到事件循环对象中

2.多任务

import asyncio
import time
start = time.time()
#在特殊函数内部的实现中不能够出现不支持异步的模块代码
async def get_request(url):
    await asyncio.sleep(2)    # 若是使用time的模块的sleep则不支持异步
    print('下载成功:',url)

urls = [
    'www.1.com',
    'www.2.com'
]
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)    # 建立任务对象
    # 多任务能够在这里绑定回调
    tasks.append(task)

loop = asyncio.get_event_loop()    # 建立事件循环对象
#注意：挂起操做须要手动处理，
loop.run_until_complete(asyncio.wait(tasks))    # 将多个任务 注册到事件循环对象，并启用（将任务挂起）
print(time.time()-start)

3.在爬虫中的应用

import requests
import aiohttp
import time
import asyncio
s = time.time()
urls = [
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay'
]

# async def get_request(url):
#     page_text = requests.get(url).text
#     return page_text

# 使用aiohttp进行获取请求，它支持异步，requests不支持异步
async def get_request(url):
   async with aiohttp.ClientSession() as s:
       async with await s.get(url=url) as response:    # 发送一个get请求，细节处理：每一个前面加一个async,遇到阻塞的加await
           page_text = await response.text()
           print(page_text)
   return page_text
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)    # 封装一个所任务对象
    tasks.append(task)

loop = asyncio.get_event_loop()    # 建立事件循环对象
loop.run_until_complete(asyncio.wait(tasks))    # 将多个任务 注册到事件循环对象，并启用（将任务挂起）
print(time.time()-s)

step4:

单线程+多任务异步协程总结：

import aiohttp
import asyncio
import time
from lxml import etree
start = time.time()
urls = [
    'http://127.0.0.1:5000/bobo',
    'http://127.0.0.1:5000/jay',
    'http://127.0.0.1:5000/tom'
]

# 特殊的函数：请求发送和响应数据的捕获
# 细节:在每个with前加上async,在每个阻塞操做的前边加上await
async def get_request(url):
    async with aiohttp.ClientSession() as s:    # requests不能发送异步请求因此使用aiohttp
        # s.get(url, headers=headers, proxy="http://ip:port", params)
        async with await s.get(url) as response:
            page_text = await response.text()    # read()返回的是byte类型的数据
            return page_text

# 回调函数(普通函数)
def parse(task):
    page_text = task.result()
    tree = etree.HTML(page_text)
    parse_data = tree.xpath('//li/text()')
    print(parse_data)
# 多任务
tasks = []
for url in urls:
    c = get_request(url)
    task = asyncio.ensure_future(c)    # 封装一个任务对象
    task.add_done_callback(parse)    # 当任务对象执行完了以后才会回调
    tasks.append(task)

# 将多任务注册到事件循环当中
loop = asyncio.get_event_loop()    # 建立事件循环对象
loop.run_until_complete(asyncio.wait(tasks))    # 将任务对象注册到事件循环对象中，并开启事件循环对象,这里wait是挂起的意思

print(time.time()-start)

selenium模块

selenium模块在爬虫中的使用

概念：是一个基于浏览器自动化的模块

爬虫之间的关联：便捷的捕获到动态加载到的数据。(可见便可得)，缺点是慢实现模拟登陆

环境安装：pip install selenium

基本使用：准备好某一款浏览器的驱动程序+ 版本的映射关系，实例化某一款浏览器对象

selenium模块的基本操做：

from selenium import webdriver
from time import sleep
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://www.jd.com/')
sleep(1)
# 进行标签订位
search_input = bro.find_element_by_id('key')
search_input.send_keys('mac pro')

btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)

# 执行js
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
page_text = bro.page_source
print(page_text)
sleep(2)
bro.quit()

selenium爬取动态加载的数据

from selenium import webdriver
from time import sleep
from lxml import etree
bro = webdriver.Chrome(executable_path='chromedriver.exe')

bro.get('http://scxk.nmpa.gov.cn:81/xk/')
sleep(2)
page_text = bro.page_source
page_text_list = [page_text]

for i in range(3):
    bro.find_element_by_id('pageIto_next').click()    # 点击下一页
    sleep(2)
    page_text_list.append(bro.page_source)

for page_text in page_text_list:
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//ul[@id="gzlist"]/li')
    for li in li_list:
        title = li.xpath('./dl/@title')[0]
        num = li.xpath('./ol/@title')[0]
        print(title, num)

sleep(2)
bro.quit()

动做链

动做链：

一系列连续的动做在实现标签订位时，若是发现定位的标签是存在于iframe标签之中的，则在定位时必须执行一个固定的操做：bro.switch_to.frame('id')

若是里面还嵌套了iframe

from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-example-draggable')
# 若是里面还嵌套了iframe
bro.switch_to.frame('iframeResult')

div_tag = bro.find_element_by_id('draggable')
print(div_tag)

# 拖动=点击+滑动
action = ActionChains(bro)
action.click_and_hold(div_tag)    # 点击中加滑动

for i in range(5):
    # perform让动做链当即执行
    action.move_by_offset(17, 5).perform()
    sleep(0.5)
action.release()    # 让action回收一下

sleep(3)
bro.quit()

12306模拟登入

# 模拟登入12306
from selenium import webdriver
from time import sleep
from PIL import Image
from selenium.webdriver import ActionChains
from Cjy import Chaojiying_Client
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path='chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/login/init')
sleep(5)
bro.save_screenshot('main.png')    # 这个截图对图片格式有要求须要为.png

code_img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')

location = code_img_tag.location
size = code_img_tag.size
print(location, type(location))
print(size)
#裁剪的区域范围
rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height']))

print(rangle)
# 裁剪图
i = Image.open('./main.png')
frame = i.crop(rangle)
frame.save('code.png')


def get_text(imgPath,imgType):
    chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, imgType)['pic_str']


#55,70|267,133 ==[[55,70],[33,66]]
result = get_text('./code.png',9004)
all_list = []
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)
# action = ActionChains(bro)
for a in all_list:
    x = a[0]
    y = a[1]
    ActionChains(bro).move_to_element_with_offset(code_img_tag,x,y).click().perform()
    sleep(1)

bro.find_element_by_id('username').send_keys('123456')
sleep(1)
bro.find_element_by_id('password').send_keys('67890000000')
sleep(1)
bro.find_element_by_id('loginSub').click()

sleep(5)
bro.quit()

selenium的其余操做

简介：

无头浏览器的操做：无可视化界面的浏览器，PhantomJs:中止更新了

谷歌无头浏览器：让selenium规避检测，使用的是谷歌无头浏览器

from selenium import webdriver
from time import sleep

# 用到时直接粘贴复制
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 后面是你的浏览器驱动位置，记得前面加'r'是防止字符转义的
driver = webdriver.Chrome(r'chromedriver.exe', chrome_options=chrome_options)
driver.get('https://www.cnblogs.com/')
print(driver.page_source)
#如何规避selenium被检测
# 查看是否被规避掉，在console中输入window.navigator.webdriver,返回undefined则爬虫有效，返回True则被网站规避掉
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from time import sleep

# 用到时直接粘贴复制
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

driver = webdriver.Chrome(r'chromedriver.exe',options=option)
driver.get('https://www.taobao.com/')

Python爬虫合集：花6k学习爬虫，终于知道爬虫能干吗了

爬虫Ⅰ:爬虫的基础知识

爬虫初始：

爬虫简介（了解）：

request模块的基本使用：

爬取搜狗首页源码数据

实现一个简易的网页采集器

初版：

第二版：

第三版:（加一个headers,模拟浏览器登入）

当网页发生局部刷新

请求为post是参数为data——肯德基案例

数据解析简介

数据解析

爬取byte类型数据

正则解析

bs4解析

1.知识点：

2.代码示例：

3.具体案例：

xpath解析

1. 知识点：

2.代码示例：

3.具体案例：

4.提升xpath通用性

中文乱码的处理

代理

使用代理的简单案例

代理池

代理池:构建本身的代理池

从代理精灵中提取代理ip,为了获取一系列ip来构建本身的代理池

第一步:爬取ip,port http类型

第二步: 检测,将可用的ip留下来

cookie的处理

cookie的处理:

案例:对雪球网中的新闻数据进行爬取https://xueqiu.com/

手动

自动:将Cookie永久存储session

验证码的识别

step1:超级鹰中的示例代码

step2:识别古诗文网中的验证码

step3:重要 模拟登入，目的是为了将cookie保存到session中

单线程+多任务异步协程

1.协程

2.多任务

3.在爬虫中的应用

step4:

单线程+多任务异步协程总结：

selenium模块

selenium模块在爬虫中的使用

selenium模块的基本操做：

selenium爬取动态加载的数据

动做链

12306模拟登入

selenium的其余操做

简介：

step3:重要模拟登入，目的是为了将cookie保存到session中