Python爬虫 | 多线程、多进程、协程

时间 2020-05-26

原文原文链接

对于操做系统来讲，一个任务就是一个进程（Process），好比打开一个浏览器就是启动一个浏览器进程，打开一个记事本就启动了一个记事本进程，打开两个记事本就启动了两个记事本进程，打开一个Word就启动了一个Word进程。html

有些进程还不止同时干一件事，好比Word，它能够同时进行打字、拼写检查、打印等事情。在一个进程内部，要同时干多件事，就须要同时运行多个“子任务”，咱们把进程内的这些“子任务”称为线程（Thread）。web

进程、线程、协程的区别编程

多进程模式最大的优势就是稳定性高，由于一个子进程崩溃了，不会影响主进程和其余子进程。（固然主进程挂了全部进程就全挂了，可是Master进程只负责分配任务，挂掉的几率低）著名的Apache最先就是采用多进程模式。浏览器

多进程模式的缺点是建立进程的代价大，在Unix/Linux系统下，用fork调用还行，在Windows下建立进程开销巨大。另外，操做系统能同时运行的进程数也是有限的，在内存和CPU的限制下，若是有几千个进程同时运行，操做系统连调度都会成问题。session

多线程模式一般比多进程快一点，可是也快不到哪去，并且，多线程模式致命的缺点就是任何一个线程挂掉均可能直接形成整个进程崩溃，由于全部线程共享进程的内存。多线程

协程的优点：并发

最大的优点就是协程极高的执行效率。由于子程序切换不是线程切换，而是由程序自身控制，所以，没有线程切换的开销，和多线程比，线程数量越多，协程的性能优点就越明显。app

第二大优点就是不须要多线程的锁机制，由于只有一个线程，也不存在同时写变量冲突，在协程中控制共享资源不加锁，只须要判断状态就行了，因此执行效率比多线程高不少。异步

1、多进程

Case 01
# 多进程，使用Pool

from multiprocessing import Pool def f(x): return x*x if __name__ =='__main__': p = Pool(5) list = [1,2,3,4,5,6,7,8,9] print(p.map(f,list)) # map是作映射  输出：[1, 4, 9, 16, 25, 36, 49, 64, 81] Case 01-1 # 多进程，使用Pool import time import requests from multiprocessing import Pool task_list = [ 'http://bj.maitian.cn/zfall/PG1', 'http://bj.maitian.cn/zfall/PG2', 'http://bj.maitian.cn/zfall/PG3', 'http://bj.maitian.cn/zfall/PG4', 'http://bj.maitian.cn/zfall/PG5', 'http://bj.maitian.cn/zfall/PG6', 'http://bj.maitian.cn/zfall/PG7', 'http://bj.maitian.cn/zfall/PG8', 'http://bj.maitian.cn/zfall/PG9', 'http://bj.maitian.cn/zfall/PG10', ] header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } def download(url): response = requests.get(url, headers=header, timeout=30 ) return response.status_code if __name__ == '__main__': p = Pool(10) time_old = time.time() for item in p.map(download, task_list): print(item) time_new = time.time() time_cost = time_new - time_old print(time_cost)

Case 02

# 多进程，使用Process对象

from multiprocessing import Process def f(name): print('hello',name) if __name__ == '__main__': p_1 = Process(target=f, args=('bob',)) # 注意：参数是只包含一个元素的元祖  p_1.start() p_1.join() p_2 = Process(target=f, args=('alice',)) p_2.start() p_2.join() 输出： hello bob hello alice Case 02-1 # 多进程，使用Process对象 import time import requests from multiprocessing import Process task_list = [ 'http://bj.maitian.cn/zfall/PG1', 'http://bj.maitian.cn/zfall/PG2', 'http://bj.maitian.cn/zfall/PG3', 'http://bj.maitian.cn/zfall/PG4', 'http://bj.maitian.cn/zfall/PG5', ] header = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ' '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } def download(url): response = requests.get(url, headers=header, timeout=30 ) print(response.status_code) if __name__ == '__main__': for item in task_list: p = Process(target=download, args=(item,)) p.start() p.join()

2、多线程

Case 01

import threading import time class myThread(threading.Thread): def __init__(self, threadID, name, counter): threading.Thread.__init__(self) self.threadID = threadID self.name = name self.counter = counter def run(self): print("Starting " + self.name) # 得到锁，成功得到锁定后返回True # 可选的timeout参数不填时将一直阻塞直到得到锁定 # 不然超时后将返回False  threadLock.acquire() print_time(self.name, self.counter, 3) # 释放锁  threadLock.release() def print_time(threadName, delay, counter): while counter: time.sleep(delay) print("%s: %s" % (threadName, time.ctime(time.time()))) counter -= 1 threadLock = threading.Lock() threads = [] # 建立新线程 thread1 = myThread(1, "Thread-1", 1) thread2 = myThread(2, "Thread-2", 2) # 开启新线程 thread1.start() thread2.start() # 添加线程到线程列表 threads.append(thread1) threads.append(thread2) # 等待全部线程完成 for t in threads: t.join() print("Exiting Main Thread")
 Case 02 import threadpool import time def sayhello (a): print("hello: "+a) time.sleep(2) def main(): global result seed=["a","b","c"] start=time.time() task_pool=threadpool.ThreadPool(5) requests=threadpool.makeRequests(sayhello,seed) for req in requests: task_pool.putRequest(req) task_pool.wait() end=time.time() time_m = end-start print("time: "+str(time_m)) start1=time.time() for each in seed: sayhello(each) end1=time.time() print("time1: "+str(end1-start1)) if __name__ == '__main__': main()
 Case 03 from concurrent.futures import ThreadPoolExecutor import time def sayhello(a): print("hello: "+a) time.sleep(2) def main(): seed=["a","b","c"] start1=time.time() for each in seed: sayhello(each) end1=time.time() print("time1: "+str(end1-start1)) start2=time.time() with ThreadPoolExecutor(3) as executor: for each in seed: executor.submit(sayhello,each) end2=time.time() print("time2: "+str(end2-start2)) start3=time.time() with ThreadPoolExecutor(3) as executor1: executor1.map(sayhello,seed) end3=time.time() print("time3: "+str(end3-start3)) if __name__ == '__main__': main()

多线程作爬虫，若是有一个线程出现问题，全部的都失败了。因此，不适合作爬虫。async

3、协程

Case 01 Client example:await, 等待某某执行完成之后才执行下一步 import aiohttp import asyncio async def fetch(session, url): async with session.get(url,) as response: return await response.text() # 注意text加括号了  async def main(): async with aiohttp.ClientSession() as session: # 使用aiohttp库里的ClientSession()函数建立一个session对象 html = await fetch(session, 'http://www.baidu.com') # 想要使用异步函数fetch的返回结果，必须加await参数，意思是必须等它执行完毕，才会去取它的返回值 print(html) loop = asyncio.get_event_loop() # 获取EventLoop loop.run_until_complete(main()) # 执行coroutine  Case  经过gather实现并发，sleep，是暂时睡，把CPU给其余任务 经过gather方法实现并发.gather除了多任务外，还能够对任务进行分组。优先使用gather. gather的意思是「搜集」，也就是可以收集协程的结果，并且要注意，它会按输入协程的顺序保存的对应协程的执行结果。 #coding:utf-8 import asyncio async def a(t): print('-->', t) await asyncio.sleep(0.5) # 暂停0.5秒，在这期间把CPU让给其余协程，可让其余协程去执行 print('<--', t) return t * 10 def main(): futs = [a(t) for t in range(6)] # 列表生成式 print(futs) # coroutine object 协程对象  ret = asyncio.gather(*futs) #记得加 * print(ret) # <_GatheringFuture pending> 收集将来对象  loop = asyncio.get_event_loop() ret1 = loop.run_until_complete(ret) print(ret1) main() Case 03 loop.create_task比gather方法使用的更广泛一些，loop.create_task让任务开始执行 #coding:utf-8 import asyncio async def a(t): print('-->', t) await asyncio.sleep(0.5) # 这里睡0.5s print('<--', t) return t * 10 async def b(): # loop = asyncio.get_event_loop()  cnt = 0 while 1: # 死循环，无限执行下去 cnt += 1 # counter计数器的缩写 cor = a(cnt) # coroutine resp = loop.create_task(cor) await asyncio.sleep(0.1) # 睡的过程当中，a 函数就能够执行。先执行a(1),睡0.1s；再执行a(2),睡0.1s；再执行，执行到a(5)时，用时0.5s print(resp) loop = asyncio.get_event_loop() loop.run_until_complete(b())

参考：

Python的线程与进程

https://www.jianshu.com/p/262594f44549

Python的全局解释器锁（GIL）

https://www.jianshu.com/p/9eb586b64bdb

Python分布式计算

https://www.jianshu.com/p/a8ec42f6cb4e

深刻理解Python异步编程（上） - 简书

https://www.jianshu.com/p/fe146f9781d2

Awesome Asyncio 《碉堡的Asyncio·中文版》 - 简书

https://www.jianshu.com/p/4f667ecae64f

Kotlin Coroutines(协程) 彻底解析（一），协程简介 - 简书携程系列文章

https://www.jianshu.com/p/2659bbe0df16

Aiohttp相关博客

Welcome to AIOHTTP — aiohttp 3.5.4 documentation

https://aiohttp.readthedocs.io/en/stable/

https://www.cnblogs.com/shijieli/p/10826743.html