爬虫的基础知识使用实例、应用技巧、基本知识点总结和须要注意事项php
爬虫:css
+ Request + Scrapy
数据分析+机器学习html
+ numpy,pandas,matplotlib
jupyter:python
+ 启动:到你须要进去的文件夹,而后输入jupyter notebook
cell是分为不一样模式的:(Code:编写代码、markdown:编写笔记)jquery
jupyter的快捷键:web
添加cell: a, b (a向前添加,b前后添加)ajax
删除cell: xchrome
执行:shift+enter(执行而且光标到下一行),ctrl+enter(执行而且光标仍然在这一行)json
tab:自动补全切换cell的模式:api
打开帮助文档:shift + tab
一、什么是爬虫?
经过编写程序模拟浏览器上网,而后让其去互联网上爬取数据的过程
二、爬虫的分类:
通用爬虫:抓取互联网中的一整张页面数据
聚焦爬虫:抓取页面中的局部数据
增量式爬虫:用来监测网站数据更新的状况,以便爬取到网站最新更新出来的数据
三、反爬机制
四、反反爬策略
五、爬虫合法吗?
5.1爬取数据的行为风险的体现:
爬虫干扰了被访问网站的正常运营;
爬虫抓取了受到法律保护的特定类型的数据或信息。
5.2规避风险:
严格遵照网站设置的robots协议;
在规避反爬虫措施的同时,须要优化本身的代码,避免干扰被访问网站的正常运行;
在使用、传播抓取到的信息时,应审查所抓取的内容,如发现属于用户的我的信息、隐私或者他人的商业秘密的,应及时中止并删除。
六、robots协议:
文本协议特性:防君子不防小人的文本协议
什么是requests模块?Python中封装好的一个基于网络请求的模块。
requests模块的做用?用来模拟浏览器发请求
requests模块的环境安装:pip install requests
requests模块的编码流程:指定url、发起请求、获取响应数据、持久化存储
import requests # 1.指定url url = 'https://www.sogou.com/' # 2.请求发送get:get返回值是一个响应对象 response = requests.get(url=url) # 3.获取响应数据 page_text = response.text # 返回的是字符串形式的响应数据 # 4.持久化存储 with open('sogou.html',mode='w',encoding='utf-8') as fp: fp.write(page_text)
须要让url携带的参数动态化 import requests url = 'https://www.sogou.com/web' # 实现参数动态化 wd = input('enter a key:') params = { 'query': wd } # 在请求中须要将请求参数对应的字典做用到params这个get方法的参数中 response = requests.get(url=url, params=params) page_text = response.text file_name = wd+'.html' with open(file_name,encoding='utf-8',mode='w') as fp: fp.write(page_text)
import requests url = 'https://www.sogou.com/web' wd = input('enter a key') params = { 'query': wd } response = requests.get(url=url, params=params) response.encoding = 'utf-8' page_text = response.text filename = wd + '.html' with open(filename, mode='w', encoding='utf-8') as fp: fp.write(page_text)
import requests url = 'https://www.sogou.com/web' wd = input('enter a key') params = { 'query': wd } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' } response = requests.get(url=url, params=params, headers=headers) response.encoding = 'utf-8' page_text = response.text filename = wd + '.html' with open(filename, mode='w', encoding='utf-8') as fp: fp.write(page_text)
动态加载的页面数据 是经过例一个单独的请求请求到的数据 import requests url = 'https://movie.douban.com/j/chart/top_list' start = input('电影开始') end = input('电影结束') dic = { 'type': '13', 'interval_id': '100:90', 'action': '', 'start': start, 'end': end } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' } response = requests.get(url=url, params=dic, headers=headers) page_text = response.json() # json返回的是序列化好的实例对象 for dic in page_text: print(dic['title']+dic['score'])
import requests url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword' site = input('请输入地点>>') for page in range(1, 5): data = { 'cname':'', 'pid':'', 'keyword': site, 'pageIndex': '1', 'pageSize': '10' } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' } response = requests.post(url=url, data=data,headers=headers) print(response.json())
数据解析的做用:能够帮助咱们实现聚焦爬虫
数据解析的实现方式:正则、bs四、xpath、pyquery
数据解析的通用原理:
1.爬虫爬取的数据都被存储在了相关的标签之中和相关标签的属性中
2.定位标签
3.取文本或者取属性
import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' } 1.爬取byte类型数据(如何爬取图片) url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg' img_data = requests.get(url=url).content # 爬取byte类使用.content with open('./img.jpg',mode='wb') as fp: fp.write(img_data) # 弊端:不能使用UA假装 from urllib import request # url = 'https://pic.qiushibaike.com/system/pictures/12223/122231866/medium/IZ3H2HQN8W52V135.jpg' # request.urlretrieve(url, filename='./qutu.jpg')
import os import re # 糗图爬取1-3页全部的图片 # 1.使用通用爬虫将前3页对应的页面源码数据进行爬取 # 通用的url模板(不可变) 1.建立目录 dirName = "./imgLibs" if not os.path.exists(dirName): os.mkdir(dirName) url = f"https://www.qiushibaike.com/imgrank/page/%d/" # 2.下载图片 for page in range(1, 3): new_url = format(url%page) page_text = requests.get(url=new_url,headers=headers).text # 每个页码对应的源码数据 ex = '<div class="thumb">.*?<img src="(.*?)".*?</div>' img_src_list = re.findall(ex, page_text, re.S) for src in img_src_list: src = "https:" + src img_name = src.split('/')[-1] img_path = dirName + '/' + img_name #./imgLibs/xxxx.jpg request.urlretrieve(src, filename=img_path) print(img_name, '下载成功')
bs4解析bs4解析的原理:
实例化一个BeautifulSoup的对象,须要将即将被解析的页面源码数据加载到该对象中
调用BeautifulSoup对象中的相关方法和属性进行标签订位和数据提取
环境的安装:
pip install bs4
pip install lxml
BeautifulSoup的实例化:
BeautifulSoup(fp,'lxml'):将本地存储的一个html文档中的数据加载到实例化好的BeautifulSoup对象中
BeautifulSoup(page_text,'lxml'):将从互联网上获取的页面源码数据加载到实例化好的BeautifulSoup对象中
定位标签的操做:
soup.tagName:定位到第一个出现的tagName标签
属性定位:soup.find('tagName',attrName='value')
属性定位:soup.find_all('tagName',attrName='value'),返回值为列表
选择器定位:soup.select('选择器'),返回的是列表
层级选择器:>表示一个层级 空格表示多个层级
取文本:
string:获取直系的文本内容
.text:获取全部的文本内容
取属性:
tagName['attrName']
定位标签 from bs4 import BeautifulSoup fp = open('./test.html', mode='r', encoding='utf-8') soup = BeautifulSoup(fp, 'lxml') print(soup.div) # 定位到第一个出现的div find相关 print(soup.find('div', class_='song')) # 只有class_标签须要带_ print(soup.find('a', id='feng')) print(soup.find_all('div', class_='song')) # 返回的是一个列表 select相关 print(soup.select('#feng')) # 返回的是一个列表 print(soup.select('.tang > ul >li')) # 返回的是一个列表 > 表示一个层级 print(soup.select('.tang li')) # 返回一个列表 空格表示多个层级 取文本 a_tag = soup.select("#feng")[0] print(a_tag.text) div = soup.div print(div.string) # 取直系的文本内容 div = soup.find('div', class_='song') print(div.string) a_tag = soup.select('#feng')[0] print(a_tag['href'])
爬取三国整篇内容(章节名称+章节内容)http://www.shicimingju.com/book/sanguoyanyi.html fp = open('./sanguo.txt', mode='w', encoding='utf-8') main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(url=main_url, headers=headers).text soup1 = BeautifulSoup(page_text, 'lxml') title_list = soup1.select('.book-mulu > ul > li > a') for page in title_list: title = page.string title_url = 'https://www.shicimingju.com' + page['href'] title_text = requests.get(url=title_url, headers=headers).text # 解析详情页中的章节内容 soup = BeautifulSoup(title_text, 'lxml') content = soup.find('div', class_='chapter_content').text fp.write(title + ':' + content + '\n') print(f'{title}下载成功')
xpath解析的实现原理:
1.实例化一个etree的对象,而后将即将被解析的页面源码加载到该对象中
2.使用etree对象中的xpath方法结合着不一样形式的xpath表达式实现标签订位和数据提取
环境安装:
pip install lxmletree
对象的实例化:
etree.parse('test.html') # 本地文件
etree.HTML(page_text) # 互联网页面
xpath表达式:xpath方法的返回值必定是一个列表
最左侧的/表示:xpath表达式必定要从根标签逐层进行标签查找和定位
最左侧的//表示:xpath表达式能够从任意位置定位标签
非最左侧的/:表示一个层级
非最左侧的//:表示跨多个层级
属性定位://tagName[@attrName="value"]
索引定位://tagName[index] 索引是从1开始
取文本:/text():直系文本内容//text():全部的文本内容
取属性:/@attrName
from lxml import etree tree = etree.parse('./test.html') 标签订位 print(tree.xpath('/html/head/title')) print(tree.xpath('//title')) print(tree.xpath('/html/body//p')) print(tree.xpath('//p')) 属性定位 print(tree.xpath('//div[@class="song"]')) print(tree.xpath('//li[3]')) # 返回的是一个对象地址 取文本 print(tree.xpath('//a[@id="feng"]/text()')[0]) # 返回的是列表 print(tree.xpath('//div[@class="song"]//text()')) # 返回的是列表 取属性 print(tree.xpath('//a[@id="feng"]/@href')) # 返回的是列表
#爬取糗百中的段子内容和做者名称 url = 'https://www.qiushibaike.com/text/' page_text = requests.get(url,headers=headers).text #解析内容 tree = etree.HTML(page_text) div_list = tree.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div[1]/a[2]/h2/text()')[0]#实现局部解析 content = div.xpath('./a[1]/div/span//text()') content = ''.join(content) print(author,content)
https://www.aqistudy.cn/historydata/ 爬取全部城市名称 url = 'https://www.aqistudy.cn/historydata/' page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) print(tree) city_list1 = tree.xpath('//div[@class="bottom"]/ul/li/a/text()') print(city_list1) city_list2 = tree.xpath('//ul[@class="unstyled"]//li/a/text()') print(city_list2) 利用|提升xpath的通用性(当前面表达式生效时执行前面,后面表达式生效时执行后面。两个同时生效时同时执行) cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text() | //ul[@class="unstyled"]//li/a/text()') print(cities)
#http://pic.netbian.com/4kmeinv/中文乱码的处理 dirName = './meinvLibs' if not os.path.exists(dirName): os.mkdir(dirName) url = 'http://pic.netbian.com/4kmeinv/index_%d.html' for page in range(1,11): if page == 1: new_url = 'http://pic.netbian.com/4kmeinv/' else: new_url = format(url%page) page_text = requests.get(new_url,headers=headers).text tree = etree.HTML(page_text) a_list = tree.xpath('//div[@class="slist"]/ul/li/a') for a in a_list: img_src = 'http://pic.netbian.com'+a.xpath('./img/@src')[0] img_name = a.xpath('./b/text()')[0] img_name = img_name.encode('iso-8859-1').decode('gbk') # 对乱码部分进行编码解码 img_data = requests.get(img_src,headers=headers).content imgPath = dirName+'/'+img_name+'.jpg' with open(imgPath,'wb') as fp: fp.write(img_data) print(img_name,'下载成功!!!')
HttpConnectionPool:错误缘由
代理:代理服务器,能够接受请求而后将其转发
匿名度:
类型:http、https
免费代理:www.goubanjia.com、快代理西
cookie的处理
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Connection': 'close' } # url = 'https://www.baidu.com/s?wd=ip' url = 'http://ip.chinaz.com/' page_text = requests.get(url=url, headers=headers, proxies={'http': '123.169.122.111:9999'}).text with open('./ip.html', mode='w', encoding='utf-8') as fp: fp.write(page_text)
import random proxy_list = { {'https': '121.231.94.44:8888'}, {'https': '131.231.94.44:8888'}, {'https': '141.231.94.44:8888'} } url = 'https://www.baidu.com/s?wd=ip' page_text = requests.get(url=url, headers=headers, proxies=random.choice(proxy_list)).text with open('ip.html', 'w', enconding='utf-8') as fp: fp.write(page_text)
from lxml import etree ip_url = 'http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=4&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2' page_text = requests.get(ip_url, headers=headers).text tree = etree.HTML(page_text) ip_list = tree.xpath('//body//text()') print(ip_list)
import random headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Connection': "close" } # url = 'https://www.xicidaili.com/nn/%d' # 西祠代理(已挂) url = 'https://www.kuaidaili.com/free/inha/%d/' proxy_list_http = [] proxy_list_https = [] for page in range(1, 20): new_url = format(url%page) ip_port = random.choice(ip_list) page_text = requests.get(new_url, headers=headers, proxies={'https': ip_port}).text tree = etree.HTML(page_text) # tbody不能够出如今xpath表达式中 tr_list = tree.xpath('//*[@id="list"]/table//tr')[1:] # 这里不能要tbody,索引是从1开始的 for tr in tr_list: ip = tr.xpath('./td[1]/text()')[0] # 返回的是一个列表 port = tr.xpath('./td[2]/text()')[0] t_type = tr.xpath('/td[4]/text()')[0] ips = ip+":" + port if t_type == 'HTTP': dic = { t_type: ips } proxy_list_http.append(dic) else: dic = { t_type: ips } proxy_list_https.append(dic) print(len(proxy_list_http), len(proxy_list_https))
for ip in proxy_list_http: response = requests.get('https://www/sogou.com', headers=headers,proxies={'https': ip}) if response.status_code == '200': print('检测到了可用的ip')
手动处理:将cookie封装到headers中
自动处理:session对象.能够建立一个session对象,该对象能够像requests同样进行请求发送;
不一样之处在于若是在使用session进行请求发送的过程当中产生了cookie,则cookie会被自动存储在session对象中
headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36', 'Cookie':'device_id=24700f9f1986800ab4fcc880530dd0ed; xq_a_token=db48cfe87b71562f38e03269b22f459d974aa8ae; xqat=db48cfe87b71562f38e03269b22f459d974aa8ae; xq_r_token=500b4e3d30d8b8237cdcf62998edbf723842f73a; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYwNjk2MzA1MCwiY3RtIjoxNjA1NTM1Mjc2NzYxLCJjaWQiOiJkOWQwbjRBWnVwIn0.PhEaPnWolUZRgyuOY-QO04Bn_A_HYU46Hm54_kWBxa8IZ6cFw20trOr7rKp7XztprxEFc7fkMN2_5abfh1TUyyFKqTDn7IfoThXyJ2lJCnH33q1q-K9BclYvLHrLGqt8jQ3YOJi7-nyiSb5ZTNk7TLEhiFfsbXaZK9evNrt7W65MdxoEWyCcGjbhI5znffRxDDLHD9511bd9upY9CUGbf4SHQwwx4PxyQqdy9j5bgqPN6rsuHoCvjcr42DZYRd8B72uQTkFs-Lnru4AFxt4o4gdaxPo_Qd_IqzCrXnwoLtCdX6n4NKV44SryBttE0SKQC6UbqC35PwN-JqPeWCHKpQ; u=201605535281005; Hm_lvt_1db88642e346389874251b5a1eded6e3=1605354060,1605411081,1605535282; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1605535282' } params = { 'status_id': '163425862', 'page': '1', 'size': '14' } url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14' page_text = requests.get(url=url, headers=headers, params=params).json() print(page_text)
session = requests.Session() session.get('https://xueqiu.com/', headers=headers) # 自动处理cookie,将首页的cookie存储到session中,后面爬取其余页面时能够用到 url = 'https://xueqiu.com/statuses/reward/list_by_user.json?status_id=163425862&page=1&size=14' page_text = session.get(url=url, headers=headers).json() print(page_text)
import requests from hashlib import md5 class Chaojiying_Client(object): def __init__(self, username, password, soft_id): # 用户名,密码,和软件id self.username = username password = password.encode('utf8') self.password = md5(password).hexdigest() self.soft_id = soft_id self.base_params = { 'user': self.username, 'pass2': self.password, 'softid': self.soft_id, } self.headers = { 'Connection': 'Keep-Alive', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)', } def PostPic(self, im, codetype): """ im: 图片字节 codetype: 题目类型 参考 http://www.chaojiying.com/price.html """ params = { 'codetype': codetype, } params.update(self.base_params) files = {'userfile': ('ccc.jpg', im)} r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers) return r.json() def ReportError(self, im_id): """ im_id:报错题目的图片ID """ params = { 'id': im_id, } params.update(self.base_params) r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers) return r.json()
def tranformImgData(imgPath, t_type): # 验证码图片的地址和验证码的类型 chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370') # 须要注册的超级鹰的用户名,密码,和软件id im = open(imgPath, 'rb').read() return chaojiying.PostPic(im, t_type)['pic_str'] # t_type为该图片的类型码 # 从古诗文网中爬取验证码的图片,将图片保存到本地,而后将图片送入到超级鹰中识别,最后返回识别结果 url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx' page_text = requests.get(url, headers=headers).text tree = etree.HTML(page_text) img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0] # 它返回的是一个列表 img_data = requests.get(img_src, headers=headers).content # .content时爬取图片数据 with open('./code.jpg', 'wb') as fp: fp.write(img_data) tranformImgData('./code.jpg', 1004) # 将图片路径和图片类型输入进去,返回识别出来的码
# 将上述产生的验证码进行模拟登入 s = requests.Session() url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx' page_text = s.get(url, headers=headers).text tree = etree.HTML(page_text) img_src = 'https://so.gushiwen.org'+tree.xpath('//*[@id="imgCode"]/@src')[0] img_data = s.get(img_src, headers=headers).content # cookie的产生在发生验证码图片时产生,目的是:1.产生cookie,2:产生图片 with open('./code.jpg', 'wb') as fp: fp.write(img_data) # 动态获取变化的参数 __VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0] __VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0] # 获取前面超级鹰得到的验证码(将图片识别出来) code_text = tranformImgData('./code.jpg', 1004) print(code_text) # 观察是否正确 # 该login_url是点击登入按钮后出现的页面,为post请求 login_url = 'https://so.gushiwen.org/user/login.aspx?from=http%3a%2f%2fso.gushiwen.org%2fuser%2fcollect.aspx' data = { '__VIEWSTATE': __VIEWSTATE, '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR, 'from':'http://so.gushiwen.org/user/collect.aspx', 'email': 'www.zhangbowudi@qq.com', 'pwd': 'bobo328410948', 'code': code_text, 'denglu': '登陆', } page_text = s.post(url=login_url, headers=headers, data=data).text with open('login.html', mode='w', encoding='utf-8') as fp: fp.write(page_text)
协程:
在函数(特殊的函数)定义的时候,若是使用了async修饰的话,则该函数调用后会返回一个协程对象,而且函数内部的实现语句不会被当即执行
任务对象
任务对象就是对协程对象的进一步封装。任务对象高级的协程对象特殊的函数
任务对象时必需要注册到事件循环对象中
给任务对象绑定回调:爬虫的数据解析中
事件循环
当作是一个容器,容器中必须存听任务对象;
当启动事件循环对象后,则事件循环对象会对其内部存储任务对象进行异步的执行。
aiohttp:支持异步网络请求的模块
import asyncio def callback(task):#做为任务对象的回调函数 print('i am callback and ',task.result()) # task.result()是接受特殊函数内部的返回值 async def test(): print('i am test()') return 'bobo' c = test() # 封装了一个任务对象,就是对协程对象的进一步封装 task = asyncio.ensure_future(c) # 封装一个任务对象 task.add_done_callback(callback) # 给任务对象绑定回调 #建立一个事件循环的对象 loop = asyncio.get_event_loop() # 建立事件循环对象 loop.run_until_complete(task) # 将任务对象注册到事件循环对象中
import asyncio import time start = time.time() #在特殊函数内部的实现中不能够出现不支持异步的模块代码 async def get_request(url): await asyncio.sleep(2) # 若是使用time的模块的sleep则不支持异步 print('下载成功:',url) urls = [ 'www.1.com', 'www.2.com' ] tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) # 建立任务对象 # 多任务能够在这里绑定回调 tasks.append(task) loop = asyncio.get_event_loop() # 建立事件循环对象 #注意:挂起操做须要手动处理, loop.run_until_complete(asyncio.wait(tasks)) # 将多个任务 注册到事件循环对象,并启用(将任务挂起) print(time.time()-start)
import requests import aiohttp import time import asyncio s = time.time() urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay' ] # async def get_request(url): # page_text = requests.get(url).text # return page_text # 使用aiohttp进行获取请求,它支持异步,requests不支持异步 async def get_request(url): async with aiohttp.ClientSession() as s: async with await s.get(url=url) as response: # 发送一个get请求,细节处理:每一个前面加一个async,遇到阻塞的加await page_text = await response.text() print(page_text) return page_text tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) # 封装一个所任务对象 tasks.append(task) loop = asyncio.get_event_loop() # 建立事件循环对象 loop.run_until_complete(asyncio.wait(tasks)) # 将多个任务 注册到事件循环对象,并启用(将任务挂起) print(time.time()-s)
import aiohttp import asyncio import time from lxml import etree start = time.time() urls = [ 'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom' ] # 特殊的函数:请求发送和响应数据的捕获 # 细节:在每个with前加上async,在每个阻塞操做的前边加上await async def get_request(url): async with aiohttp.ClientSession() as s: # requests不能发送异步请求因此使用aiohttp # s.get(url, headers=headers, proxy="http://ip:port", params) async with await s.get(url) as response: page_text = await response.text() # read()返回的是byte类型的数据 return page_text # 回调函数(普通函数) def parse(task): page_text = task.result() tree = etree.HTML(page_text) parse_data = tree.xpath('//li/text()') print(parse_data) # 多任务 tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) # 封装一个任务对象 task.add_done_callback(parse) # 当任务对象执行完了以后才会回调 tasks.append(task) # 将多任务注册到事件循环当中 loop = asyncio.get_event_loop() # 建立事件循环对象 loop.run_until_complete(asyncio.wait(tasks)) # 将任务对象注册到事件循环对象中,并开启事件循环对象,这里wait是挂起的意思 print(time.time()-start)
概念:是一个基于浏览器自动化的模块
爬虫之间的关联:便捷的捕获到动态加载到的数据。(可见便可得),缺点是慢实现模拟登陆
环境安装:pip install selenium
基本使用:准备好某一款浏览器的驱动程序+ 版本的映射关系,实例化某一款浏览器对象
from selenium import webdriver from time import sleep bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('https://www.jd.com/') sleep(1) # 进行标签订位 search_input = bro.find_element_by_id('key') search_input.send_keys('mac pro') btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button') btn.click() sleep(2) # 执行js bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(2) page_text = bro.page_source print(page_text) sleep(2) bro.quit()
from selenium import webdriver from time import sleep from lxml import etree bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('http://scxk.nmpa.gov.cn:81/xk/') sleep(2) page_text = bro.page_source page_text_list = [page_text] for i in range(3): bro.find_element_by_id('pageIto_next').click() # 点击下一页 sleep(2) page_text_list.append(bro.page_source) for page_text in page_text_list: tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@id="gzlist"]/li') for li in li_list: title = li.xpath('./dl/@title')[0] num = li.xpath('./ol/@title')[0] print(title, num) sleep(2) bro.quit()
动做链:
一系列连续的动做在实现标签订位时,若是发现定位的标签是存在于iframe标签之中的,则在定位时必须执行一个固定的操做:bro.switch_to.frame('id')
若是里面还嵌套了iframe
from selenium import webdriver from time import sleep from selenium.webdriver import ActionChains bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('https://www.runoob.com/try/try.php?filename=jqueryui-example-draggable') # 若是里面还嵌套了iframe bro.switch_to.frame('iframeResult') div_tag = bro.find_element_by_id('draggable') print(div_tag) # 拖动=点击+滑动 action = ActionChains(bro) action.click_and_hold(div_tag) # 点击中加滑动 for i in range(5): # perform让动做链当即执行 action.move_by_offset(17, 5).perform() sleep(0.5) action.release() # 让action回收一下 sleep(3) bro.quit()
# 模拟登入12306 from selenium import webdriver from time import sleep from PIL import Image from selenium.webdriver import ActionChains from Cjy import Chaojiying_Client from selenium.webdriver import ActionChains bro = webdriver.Chrome(executable_path='chromedriver.exe') bro.get('https://kyfw.12306.cn/otn/login/init') sleep(5) bro.save_screenshot('main.png') # 这个截图对图片格式有要求须要为.png code_img_tag = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img') location = code_img_tag.location size = code_img_tag.size print(location, type(location)) print(size) #裁剪的区域范围 rangle = (int(location['x']),int(location['y']),int(location['x']+size['width']),int(location['y']+size['height'])) print(rangle) # 裁剪图 i = Image.open('./main.png') frame = i.crop(rangle) frame.save('code.png') def get_text(imgPath,imgType): chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370') im = open(imgPath, 'rb').read() return chaojiying.PostPic(im, imgType)['pic_str'] #55,70|267,133 ==[[55,70],[33,66]] result = get_text('./code.png',9004) all_list = [] if '|' in result: list_1 = result.split('|') count_1 = len(list_1) for i in range(count_1): xy_list = [] x = int(list_1[i].split(',')[0]) y = int(list_1[i].split(',')[1]) xy_list.append(x) xy_list.append(y) all_list.append(xy_list) else: x = int(result.split(',')[0]) y = int(result.split(',')[1]) xy_list = [] xy_list.append(x) xy_list.append(y) all_list.append(xy_list) print(all_list) # action = ActionChains(bro) for a in all_list: x = a[0] y = a[1] ActionChains(bro).move_to_element_with_offset(code_img_tag,x,y).click().perform() sleep(1) bro.find_element_by_id('username').send_keys('123456') sleep(1) bro.find_element_by_id('password').send_keys('67890000000') sleep(1) bro.find_element_by_id('loginSub').click() sleep(5) bro.quit()
无头浏览器的操做:无可视化界面的浏览器,PhantomJs:中止更新了
谷歌无头浏览器:让selenium规避检测,使用的是谷歌无头浏览器
from selenium import webdriver from time import sleep # 用到时直接粘贴复制 from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') # 后面是你的浏览器驱动位置,记得前面加'r'是防止字符转义的 driver = webdriver.Chrome(r'chromedriver.exe', chrome_options=chrome_options) driver.get('https://www.cnblogs.com/') print(driver.page_source) #如何规避selenium被检测 # 查看是否被规避掉,在console中输入window.navigator.webdriver,返回undefined则爬虫有效,返回True则被网站规避掉 from selenium import webdriver from selenium.webdriver import ChromeOptions from time import sleep # 用到时直接粘贴复制 option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) driver = webdriver.Chrome(r'chromedriver.exe',options=option) driver.get('https://www.taobao.com/')