实战 | 用aiohttp和uvloop实现一个高性能爬虫

时间 2019-11-17

原文原文链接

asyncio于Python3.4引入标准库，增长了对异步I/O的支持，asyncio基于事件循环，能够轻松实现异步I/O操做。接下来，咱们用基于asyncio的库实现一个高性能爬虫。python

准备工做

Earth View from Google Earth是一款Chrome插件，会在打开新标签页时自动加载一张来自Google Earth的背景图片。web

使用Chrome开发者工具观察插件的网络请求，咱们发现插件会请求一个地址如www.gstatic.com/prettyearth…的JSON文件，文件中包含了通过Base64的图片内容，观察发现，图片的ID范围大体在1000-8000之间，咱们的爬虫就要来爬取这些精美的背景图片。chrome

实现主要逻辑

因为爬取目标是JSON文件，爬虫的主要逻辑就变成了爬取JSON-->提取图片-->保存图片。json

requests是一个经常使用的http请求库，可是因为requests的请求都是同步的，咱们使用aiohttp这个异步http请求库来代替。网络

async def fetch_image_by_id(item_id):
	url = f'https://www.gstatic.com/prettyearth/assets/data/v2/{item_id}.json'
        # 因为URL是https的，因此选择不验证SSL
	async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(verify_ssl=False)) as session:
		async with session.get(url) as response:
            # 获取后须要将JSON字符串转为对象
			try:
				json_obj = json.loads(await response.text())
			except json.decoder.JSONDecodeError as e:
				print(f'Download failed - {item_id}.jpg')
				return
            # 获取JSON中的图片内容字段，通过Base64解码成二进制内容
			image_str = json_obj['dataUri'].replace('data:image/jpeg;base64,', '')
			image_data = base64.b64decode(image_str)
			save_folder = dir_path = os.path.dirname(
				os.path.realpath(__file__)) + '/google_earth/'
			with open(f'{save_folder}{item_id}.jpg', 'wb') as f:
				f.write(image_data)
			print(f'Download complete - {item_id}.jpg')
复制代码

aiohttp基于asyncio，因此在调用时须要使用async/await语法糖，能够看到，因为aiohttp中提供了一个ClientSession上下文，代码中使用了async with的语法糖。session

加入并行逻辑

上面的代码是抓取单张图片的逻辑，批量抓取图片，须要再嵌套一层方法：并发

async def fetch_all_images():
    # 使用Semaphore限制最大并发数
	sem = asyncio.Semaphore(10)
	ids = [id for id in range(1000, 8000)]
	for current_id in ids:
		async with sem:
			await fetch_image_by_id(current_id)
复制代码

接下来，将这个方法加入到asyncio的事件循环中。异步

event_loop = asyncio.get_event_loop()
future = asyncio.ensure_future(fetch_all_images())
results = event_loop.run_until_complete(future)
复制代码

使用uvloop加速

uvloop基于libuv，libuv是一个使用C语言实现的高性能异步I/O库，uvloop用来代替asyncio默认事件循环，能够进一步加快异步I/O操做的速度。async

uvloop的使用很是简单，只要在获取事件循环前，调用以下方法，将asyncio的事件循环策略设置为uvloop的事件循环策略。工具

asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
复制代码

使用上面的代码，咱们能够快速将大约1500张的图片爬取下来。

性能对比

为了验证aiohttp和uvloop的性能，笔者使用requests+concurrent库实现了一个多进程版的爬虫，分别爬取20个id，消耗的时间如图。

能够看到，耗时相差了大概7倍，aiohttp+uvloop的组合在爬虫这种I/O密集型的场景下，能够说具备压倒性优点。相信在不远的未来，基于asyncio的库会将无数爬虫工程师从加班中拯救出来。