本文首发于知乎
以前咱们讲过基于asycnio的异步爬虫实现,不过代码过于复杂,本文咱们使用gevent模块实现异步爬虫。php
本文分为以下部分html
由于使用很是简单,就直接上代码了python
import gevent
from gevent import monkey
import requests
from bs4 import BeautifulSoup
monkey.patch_all() # 对全部io操做打上补丁,固定加这一句
def get_title(i):
url = 'https://movie.douban.com/top250?start={}&filter='.format(i*25)
text = requests.get(url).content
soup = BeautifulSoup(text, 'html.parser')
lis = soup.find('ol', class_='grid_view').find_all('li')
for li in lis:
title = li.find('span', class_="title").text
print(title)
gevent.joinall([gevent.spawn(get_title, i) for i in range(10)])
复制代码
gevent本质上是开启了多个微线程,下面咱们用threading
模块来检验一下编程
import gevent
from gevent import monkey
import requests
from bs4 import BeautifulSoup
import threading
monkey.patch_all()
def get_title(i):
print(threading.current_thread().name) # 打印出当前线程名称
url = 'https://movie.douban.com/top250?start={}&filter='.format(i*25)
text = requests.get(url).content
soup = BeautifulSoup(text, 'html.parser')
lis = soup.find('ol', class_='grid_view').find_all('li')
for li in lis:
title = li.find('span', class_="title").text
print(title)
gevent.joinall([gevent.spawn(get_title, i) for i in range(10)])
复制代码
运行结果首先打印出了下面内容网络
DummyThread-1
DummyThread-2
DummyThread-3
DummyThread-4
DummyThread-5
DummyThread-6
DummyThread-7
DummyThread-8
DummyThread-9
DummyThread-10
复制代码
表示这里其实开了10个微线程同时运行。app
其实咱们也能够控制用一个线程来完成,只须要这样改异步
monkey.patch_all()
改为
monkey.patch_all(thread=False)
复制代码
requests库的做者将requests和gevent融合产生了grequests模块,专门用于异步网络请求,使用以下ui
import grequests
from bs4 import BeautifulSoup
def get_title(rep):
soup = BeautifulSoup(rep.text, 'html.parser')
lis = soup.find('ol', class_='grid_view').find_all('li')
for li in lis:
title = li.find('span', class_="title").text
print(title)
reps = (grequests.get('https://movie.douban.com/top250?start={}&filter='.format(i*25)) for i in range(10))
for rep in grequests.map(reps):
get_title(rep)
复制代码
专栏主页:python编程lua
专栏目录:目录url
版本说明:软件及包版本说明