get方法会阻塞多线程
异步爬虫方式:app
- 多线程 多进程(不建议)异步
好处:能够为相关阻塞操做单独开启线程,进程,实现异步ide
坏处:没法无限制开启多线程或多进程ui
- 线程池 进程池(适当使用)url
好处:下降系统对进程或线程建立和销毁频率,下降系统开销spa
坏处: 池中线程或进程数量有上线 (阻塞远远高于池中线程,进程时,提高效率不明显)线程
原则:处理的是阻塞且耗时的操做code
线程池的基本使用:视频
from multiprocessing.dummy import Pool import time stari_time = time.time() def f1(name): print("%s is running"%name) time.sleep(2) print("%s running done"%name) #实例化线程池对象 name_list = ['a','b','c','d'] pool = Pool(4) #pool.map(func,iterable) pool.map(f1,name_list) print(time.time()-stari_time)
线程池案例应用:
- 梨视频 生活板块 最热的视频数据
from multiprocessing.dummy import Pool import requests import re from lxml import etree headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36' } url = 'https://www.pearvideo.com/category_5' page_text = requests.get(url=url,headers=headers).text tree =etree.HTML(page_text) li_list = tree.xpath('//li[@class="categoryem"]') urls = []#全部视频的url for li in li_list: detail_url = 'https://www.pearvideo.com/'+li.xpath('./div/a/@href')[0] name = li.xpath(".//div[@class='vervideo-title']/text()")[0]+".mp4" res = requests.get(url=detail_url,headers=headers).text ex = 'srcUrl="(.*?)",vdoUrl' #动态加载的数据 xpath匹配不到script标签 用正则匹配 video_url = re.findall(ex,res)[0] dic = { 'name':name, 'url':video_url } urls.append(dic) pool =Pool(5) def f1(dic): video_content = requests.get(url=dic['url'],headers=headers).content print(dic['name'], "正在下载") #持久化存储操做 with open(dic['name'],'wb')as f: f.write(video_content) print(dic['name'],"下载成功") pool.map(f1,urls) pool.close() pool.join()