Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程/多进程还有存在的必要吗？

时间 2019-12-10

标签 python 有了 asyncio aiohttp 爬虫类型任务多线程进程还有存在必要栏目 Python 繁體版

原文原文链接

最近正在学习Python中的异步编程，看了一些博客后作了一些小测验：对比asyncio+aiohttp的爬虫和asyncio+aiohttp+concurrent.futures(线程池/进程池)在效率中的差别，注释：在爬虫中我几乎没有使用任何计算性任务，为了探测异步的性能，所有都只是作了网络IO请求，就是说aiohttp把网页get完就程序就done了。html

结果发现前者的效率比后者还要高。我询问了另一位博主，(提供代码的博主没回我信息)，他说使用concurrent.futures的话由于我所有都是IO任务，若是把这些IO任务分散到线程池/进程池，反而多线程/多进程之间的切换开销还会下降爬虫的效率。我想了想的确如此。node

那么个人问题是：仅仅在爬取网页的过程当中，就是request.get部分，多线程确定是没有存在的必要了，由于GIL这个大坑，进程池可能好点，可是性能仍是不如异步爬虫，并且更加浪费资源。既然这样，是否是之后在爬虫的爬取网页阶段咱们彻底均可以用兴起的asyncio+aiohttp代替。(以及其余IO任务好比数据库/文件读写)git

固然在数据处理阶段仍是要采用多进程，可是我以为多线程是完全没用了，本来它相比多进程的优点在于IO型任务，现看来在它的优点彻底被异步取代了。(固然问题创建在不考虑兼容2.x)github

注：还有一个额外的问题就是，看到一些博客说requests库不支持异步编程是什么意思，为了充分发回异步的优点应该使用aiohttp，我没有看过requests的源代码，可是一些结果显示aiohttp的性能确实更好，各位网友能解释一下吗？数据库

代码

asyncio+aiohttp编程

import aiohttp


async def fetch_async(a):
    async with aiohttp.request('GET', URL.format(a)) as r:
        data = await r.json()
    return data['args']['a']
    
start = time.time()
event_loop = asyncio.get_event_loop()
tasks = [fetch_async(num) for num in NUMBERS]
results = event_loop.run_until_complete(asyncio.gather(*tasks))

for num, result in zip(NUMBERS, results):
    print('fetch({}) = {}'.format(num, result))

asyncio+aiohttp+线程池比上面要慢1秒json

async def fetch_async(a):
    async with aiohttp.request('GET', URL.format(a)) as r:
        data = await r.json()
    return a, data['args']['a']


def sub_loop(numbers):
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    tasks = [fetch_async(num) for num in numbers]
    results = loop.run_until_complete(asyncio.gather(*tasks))
    for num, result in results:
        print('fetch({}) = {}'.format(num, result))


async def run(executor, numbers):
    await asyncio.get_event_loop().run_in_executor(executor, sub_loop, numbers)


def chunks(l, size):
    n = math.ceil(len(l) / size)
    for i in range(0, len(l), n):
        yield l[i:i + n]                                                     

event_loop = asyncio.get_event_loop()
tasks = [run(executor, chunked) for chunked in chunks(NUMBERS, 3)]
results = event_loop.run_until_complete(asyncio.gather(*tasks))

print('Use asyncio+aiohttp+ThreadPoolExecutor cost: {}'.format(time.time() - start))

传统的requests + ThreadPoolExecutor比上面慢了3倍segmentfault

import time
import requests
from concurrent.futures import ThreadPoolExecutor

NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'

def fetch(a):
    r = requests.get(URL.format(a))
    return r.json()['args']['a']

start = time.time()
with ThreadPoolExecutor(max_workers=3) as executor:
    for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)):
        print('fetch({}) = {}'.format(num, result))

print('Use requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))

补充

以上问题创建在CPython，至于我喜欢用多线程，不喜欢协程风格这类型的回答显然不属于本题讨论范畴。我主要想请教的是：
若是Python拿不下GIL，我认为将来理想的模型应该是多进程 + 协程(asyncio+aiohttp)。uvloop和sanic以及500lines一个爬虫项目已经开始这么干了。不讨论兼容型问题，上面的见解是否正确，有一些什么场景协程没法取代多线程。网络

异步有不少方案，twisted, tornado等都有本身的解决方案，问题创建在asyncio+aiohttp的协程异步。多线程

还有一个问题也想向各位网友请教一下

Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程/多进程还有存在的必要吗？ >> node.js

这个答案描述的挺清楚的：
http://www.goodpm.net/postreply/node.js/1010000007987098/Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程多进程还有存在的必要吗.html