Python爬虫进阶六之多进程的用法

时间 2019-11-10

标签 python 爬虫进阶之多进程用法栏目 Python 繁體版

原文原文链接

前言

在上一节中介绍了thread多线程库。python中的多线程其实并非真正的多线程，并不能作到充分利用多核CPU资源。html

若是想要充分利用，在python中大部分状况须要使用多进程，那么这个包就叫作 multiprocessing。python

借助它，能够轻松完成从单进程到并发执行的转换。multiprocessing支持子进程、通讯和共享数据、执行不一样形式的同步，提供了Process、Queue、Pipe、Lock等组件。git

那么本节要介绍的内容有：github

Process
Lock
Semaphore
Queue
Pipe
Pool

Process

基本使用

在multiprocessing中，每个进程都用一个Process类来表示。首先看下它的API数组

1	Process([group [, target [, name [, args [, kwargs]]]]])

target表示调用对象，你能够传入方法的名字
args表示被调用对象的位置参数元组，好比target是函数a，他有两个参数m，n，那么args就传入(m, n)便可
kwargs表示调用对象的字典
name是别名，至关于给这个进程取一个名字
group分组，实际上不使用

咱们先用一个实例来感觉一下：数据结构

import multiprocessing

def process(num):

print 'Process:', num

if __name__ == '__main__':

for i in range(5):

p = multiprocessing.Process(target=process, args=(i,))

p.start()

最简单的建立Process的过程如上所示，target传入函数名，args是函数的参数，是元组的形式，若是只有一个参数，那就是长度为1的元组。多线程

而后调用start()方法便可启动多个进程了。并发

另外你还能够经过 cpu_count() 方法还有 active_children() 方法获取当前机器的 CPU 核心数量以及获得目前全部的运行的进程。app

经过一个实例来感觉一下：dom

import multiprocessing

import time

def process(num):

time.sleep(num)

print 'Process:', num

if __name__ == '__main__':

for i in range(5):

p = multiprocessing.Process(target=process, args=(i,))

p.start()

print('CPU number:' + str(multiprocessing.cpu_count()))

for p in multiprocessing.active_children():

print('Child process name: ' + p.name + ' id: ' + str(p.pid))

print('Process Ended')

运行结果：

Process: 0

CPU number:8

Child process name: Process-2 id: 9641

Child process name: Process-4 id: 9643

Child process name: Process-5 id: 9644

Child process name: Process-3 id: 9642

Process Ended

Process: 1

Process: 2

Process: 3

Process: 4

自定义类

另外你还能够继承Process类，自定义进程类，实现run方法便可。

用一个实例来感觉一下：

from multiprocessing import Process

import time

class MyProcess(Process):

def __init__(self, loop):

Process.__init__(self)

self.loop = loop

def run(self):

for count in range(self.loop):

time.sleep(1)

print('Pid: ' + str(self.pid) + ' LoopCount: ' + str(count))

if __name__ == '__main__':

for i in range(2, 5):

p = MyProcess(i)

p.start()

在上面的例子中，咱们继承了 Process 这个类，而后实现了run方法。打印出来了进程号和参数。

运行结果：

Pid: 28116 LoopCount: 0

Pid: 28117 LoopCount: 0

Pid: 28118 LoopCount: 0

Pid: 28116 LoopCount: 1

Pid: 28117 LoopCount: 1

Pid: 28118 LoopCount: 1

Pid: 28117 LoopCount: 2

Pid: 28118 LoopCount: 2

Pid: 28118 LoopCount: 3

能够看到，三个进程分别打印出了二、三、4条结果。

咱们能够把一些方法独立的写在每一个类里封装好，等用的时候直接初始化一个类运行便可。

deamon

在这里介绍一个属性，叫作deamon。每一个线程均可以单独设置它的属性，若是设置为True，当父进程结束后，子进程会自动被终止。

用一个实例来感觉一下，仍是原来的例子，增长了deamon属性：

from multiprocessing import Process

import time

class MyProcess(Process):

def __init__(self, loop):

Process.__init__(self)

self.loop = loop

def run(self):

for count in range(self.loop):

time.sleep(1)

print('Pid: ' + str(self.pid) + ' LoopCount: ' + str(count))

if __name__ == '__main__':

for i in range(2, 5):

p = MyProcess(i)

p.daemon = True

p.start()

print 'Main process Ended!'

在这里，调用的时候增长了设置deamon，最后的主进程（即父进程）打印输出了一句话。

运行结果：

1	Main process Ended!

结果很简单，由于主进程没有作任何事情，直接输出一句话结束，因此在这时也直接终止了子进程的运行。

这样能够有效防止无控制地生成子进程。若是这样写了，你在关闭这个主程序运行时，就无需额外担忧子进程有没有被关闭了。

不过这样并非咱们想要达到的效果呀，能不能让全部子进程都执行完了而后再结束呢？那固然是能够的，只须要加入join()方法便可。

from multiprocessing import Process

import time

class MyProcess(Process):

def __init__(self, loop):

Process.__init__(self)

self.loop = loop

def run(self):

for count in range(self.loop):

time.sleep(1)

print('Pid: ' + str(self.pid) + ' LoopCount: ' + str(count))

if __name__ == '__main__':

for i in range(2, 5):

p = MyProcess(i)

p.daemon = True

p.start()

p.join()

print 'Main process Ended!'

在这里，每一个子进程都调用了join()方法，这样父进程（主进程）就会等待子进程执行完毕。

运行结果：

Pid: 29902 LoopCount: 0

Pid: 29902 LoopCount: 1

Pid: 29905 LoopCount: 0

Pid: 29905 LoopCount: 1

Pid: 29905 LoopCount: 2

Pid: 29912 LoopCount: 0

Pid: 29912 LoopCount: 1

Pid: 29912 LoopCount: 2

Pid: 29912 LoopCount: 3

Main process Ended!

发现全部子进程都执行完毕以后，父进程最后打印出告终束的结果。

Lock

在上面的一些小实例中，你可能会遇到以下的运行结果：

什么问题？有的输出错位了。这是因为并行致使的，两个进程同时进行了输出，结果第一个进程的换行没有来得及输出，第二个进程就输出告终果。因此致使这种排版的问题。

那这归根结底是由于线程同时资源（输出操做）而致使的。

那怎么来避免这种问题？那天然是在某一时间，只能一个进程输出，其余进程等待。等刚才那个进程输出完毕以后，另外一个进程再进行输出。这种现象就叫作“互斥”。

咱们能够经过 Lock 来实现，在一个进程输出时，加锁，其余进程等待。等此进程执行结束后，释放锁，其余进程能够进行输出。

咱们现用一个实例来感觉一下：

from multiprocessing import Process, Lock

import time

class MyProcess(Process):

def __init__(self, loop, lock):

Process.__init__(self)

self.loop = loop

self.lock = lock

def run(self):

for count in range(self.loop):

time.sleep(0.1)

#self.lock.acquire()

print('Pid: ' + str(self.pid) + ' LoopCount: ' + str(count))

#self.lock.release()

if __name__ == '__main__':

lock = Lock()

for i in range(10, 15):

p = MyProcess(i, lock)

p.start()

首先看一下不加锁的输出结果：

Pid: 45755 LoopCount: 0

Pid: 45756 LoopCount: 0

Pid: 45757 LoopCount: 0

Pid: 45758 LoopCount: 0

Pid: 45759 LoopCount: 0

Pid: 45755 LoopCount: 1

Pid: 45756 LoopCount: 1

Pid: 45757 LoopCount: 1

Pid: 45758 LoopCount: 1

Pid: 45759 LoopCount: 1

Pid: 45755 LoopCount: 2Pid: 45756 LoopCount: 2

Pid: 45757 LoopCount: 2

Pid: 45758 LoopCount: 2

Pid: 45759 LoopCount: 2

Pid: 45756 LoopCount: 3

Pid: 45755 LoopCount: 3

Pid: 45757 LoopCount: 3

Pid: 45758 LoopCount: 3

Pid: 45759 LoopCount: 3

Pid: 45755 LoopCount: 4

Pid: 45756 LoopCount: 4

Pid: 45757 LoopCount: 4

Pid: 45759 LoopCount: 4

Pid: 45758 LoopCount: 4

Pid: 45756 LoopCount: 5

Pid: 45755 LoopCount: 5

Pid: 45757 LoopCount: 5

Pid: 45759 LoopCount: 5

Pid: 45758 LoopCount: 5

Pid: 45756 LoopCount: 6Pid: 45755 LoopCount: 6

Pid: 45757 LoopCount: 6

Pid: 45759 LoopCount: 6

Pid: 45758 LoopCount: 6

Pid: 45755 LoopCount: 7Pid: 45756 LoopCount: 7

Pid: 45757 LoopCount: 7

Pid: 45758 LoopCount: 7

Pid: 45759 LoopCount: 7

Pid: 45756 LoopCount: 8Pid: 45755 LoopCount: 8

Pid: 45757 LoopCount: 8

Pid: 45758 LoopCount: 8Pid: 45759 LoopCount: 8

Pid: 45755 LoopCount: 9

Pid: 45756 LoopCount: 9

Pid: 45757 LoopCount: 9

Pid: 45758 LoopCount: 9

Pid: 45759 LoopCount: 9

Pid: 45756 LoopCount: 10

Pid: 45757 LoopCount: 10

Pid: 45758 LoopCount: 10

Pid: 45759 LoopCount: 10

Pid: 45757 LoopCount: 11

Pid: 45758 LoopCount: 11

Pid: 45759 LoopCount: 11

Pid: 45758 LoopCount: 12

Pid: 45759 LoopCount: 12

Pid: 45759 LoopCount: 13

能够看到有些输出已经形成了影响。

而后咱们对其加锁：

from multiprocessing import Process, Lock

import time

class MyProcess(Process):

def __init__(self, loop, lock):

Process.__init__(self)

self.loop = loop

self.lock = lock

def run(self):

for count in range(self.loop):

time.sleep(0.1)

self.lock.acquire()

print('Pid: ' + str(self.pid) + ' LoopCount: ' + str(count))

self.lock.release()

if __name__ == '__main__':

lock = Lock()

for i in range(10, 15):

p = MyProcess(i, lock)

p.start()

咱们在print方法的先后分别添加了得到锁和释放锁的操做。这样就能保证在同一时间只有一个print操做。

看一下运行结果：

Pid: 45889 LoopCount: 0

Pid: 45890 LoopCount: 0

Pid: 45891 LoopCount: 0

Pid: 45892 LoopCount: 0

Pid: 45893 LoopCount: 0

Pid: 45889 LoopCount: 1

Pid: 45890 LoopCount: 1

Pid: 45891 LoopCount: 1

Pid: 45892 LoopCount: 1

Pid: 45893 LoopCount: 1

Pid: 45889 LoopCount: 2

Pid: 45890 LoopCount: 2

Pid: 45891 LoopCount: 2

Pid: 45892 LoopCount: 2

Pid: 45893 LoopCount: 2

Pid: 45889 LoopCount: 3

Pid: 45890 LoopCount: 3

Pid: 45891 LoopCount: 3

Pid: 45892 LoopCount: 3

Pid: 45893 LoopCount: 3

Pid: 45889 LoopCount: 4

Pid: 45890 LoopCount: 4

Pid: 45891 LoopCount: 4

Pid: 45892 LoopCount: 4

Pid: 45893 LoopCount: 4

Pid: 45889 LoopCount: 5

Pid: 45890 LoopCount: 5

Pid: 45891 LoopCount: 5

Pid: 45892 LoopCount: 5

Pid: 45893 LoopCount: 5

Pid: 45889 LoopCount: 6

Pid: 45890 LoopCount: 6

Pid: 45891 LoopCount: 6

Pid: 45893 LoopCount: 6

Pid: 45892 LoopCount: 6

Pid: 45889 LoopCount: 7

Pid: 45890 LoopCount: 7

Pid: 45891 LoopCount: 7

Pid: 45892 LoopCount: 7

Pid: 45893 LoopCount: 7

Pid: 45889 LoopCount: 8

Pid: 45890 LoopCount: 8

Pid: 45891 LoopCount: 8

Pid: 45892 LoopCount: 8

Pid: 45893 LoopCount: 8

Pid: 45889 LoopCount: 9

Pid: 45890 LoopCount: 9

Pid: 45891 LoopCount: 9

Pid: 45892 LoopCount: 9

Pid: 45893 LoopCount: 9

Pid: 45890 LoopCount: 10

Pid: 45891 LoopCount: 10

Pid: 45892 LoopCount: 10

Pid: 45893 LoopCount: 10

Pid: 45891 LoopCount: 11

Pid: 45892 LoopCount: 11

Pid: 45893 LoopCount: 11

Pid: 45893 LoopCount: 12

Pid: 45892 LoopCount: 12

Pid: 45893 LoopCount: 13

嗯，一切都没问题了。

因此在访问临界资源时，使用Lock就能够避免进程同时占用资源而致使的一些问题。

Semaphore

信号量，是在进程同步过程当中一个比较重要的角色。能够控制临界资源的数量，保证各个进程之间的互斥和同步。

若是你学过操做系统，那么必定对这方面很是了解，若是你还不了解信号量是什么，能够参考

信号量解析

来了解一下它是作什么的。

那么接下来咱们就用一个实例来演示一下进程之间利用Semaphore作到同步和互斥，以及控制临界资源数量。

from multiprocessing import Process, Semaphore, Lock, Queue

import time

buffer = Queue(10)

empty = Semaphore(2)

full = Semaphore(0)

lock = Lock()

class Consumer(Process):

def run(self):

global buffer, empty, full, lock

while True:

full.acquire()

lock.acquire()

buffer.get()

print('Consumer pop an element')

time.sleep(1)

lock.release()

empty.release()

class Producer(Process):

def run(self):

global buffer, empty, full, lock

while True:

empty.acquire()

lock.acquire()

buffer.put(1)

print('Producer append an element')

time.sleep(1)

lock.release()

full.release()

if __name__ == '__main__':

p = Producer()

c = Consumer()

p.daemon = c.daemon = True

p.start()

c.start()

p.join()

c.join()

print 'Ended!'

如上代码实现了注明的生产者和消费者问题，定义了两个进程类，一个是消费者，一个是生产者。

定义了一个共享队列，利用了Queue数据结构，而后定义了两个信号量，一个表明缓冲区空余数，一个表示缓冲区占用数。

生产者Producer使用empty.acquire()方法来占用一个缓冲区位置，而后缓冲区空闲区大小减少1，接下来进行加锁，对缓冲区进行操做。而后释放锁，而后让表明占用的缓冲区位置数量+1，消费者则相反。

运行结果以下：

Producer append an element

Consumer pop an element

Producer append an element

Consumer pop an element

Producer append an element

Consumer pop an element

Producer append an element

能够发现两个进程在交替运行，生产者先放入缓冲区物品，而后消费者取出，不停地进行循环。

经过上面的例子来体会一下信号量的用法。

Queue

在上面的例子中咱们使用了Queue，能够做为进程通讯的共享队列使用。

在上面的程序中，若是你把Queue换成普通的list，是彻底起不到效果的。即便在一个进程中改变了这个list，在另外一个进程也不能获取到它的状态。

所以进程间的通讯，队列须要用Queue。固然这里的队列指的是 multiprocessing.Queue

依然是用上面那个例子，咱们一个进程向队列中放入数据，而后另外一个进程取出数据。

from multiprocessing import Process, Semaphore, Lock, Queue

import time

from random import random

buffer = Queue(10)

empty = Semaphore(2)

full = Semaphore(0)

lock = Lock()

class Consumer(Process):

def run(self):

global buffer, empty, full, lock

while True:

full.acquire()

lock.acquire()

print 'Consumer get', buffer.get()

time.sleep(1)

lock.release()

empty.release()

class Producer(Process):

def run(self):

global buffer, empty, full, lock

while True:

empty.acquire()

lock.acquire()

num = random()

print 'Producer put ', num

buffer.put(num)

time.sleep(1)

lock.release()

full.release()

if __name__ == '__main__':

p = Producer()

c = Consumer()

p.daemon = c.daemon = True

p.start()

c.start()

p.join()

c.join()

print 'Ended!'

运行结果：

Producer put 0.719213647437

Producer put 0.44287326683

Consumer get 0.719213647437

Consumer get 0.44287326683

Producer put 0.722859424381

Producer put 0.525321338921

Consumer get 0.722859424381

Consumer get 0.525321338921

能够看到生产者放入队列中数据，而后消费者将数据取出来。

get方法有两个参数，blocked和timeout，意思为阻塞和超时时间。默认blocked是true，即阻塞式。

当一个队列为空的时候若是再用get取则会阻塞，因此这时候就须要吧blocked设置为false，即非阻塞式，实际上它就会调用get_nowait()方法，此时还须要设置一个超时时间，在这么长的时间内尚未取到队列元素，那就抛出Queue.Empty异常。

当一个队列为满的时候若是再用put放则会阻塞，因此这时候就须要吧blocked设置为false，即非阻塞式，实际上它就会调用put_nowait()方法，此时还须要设置一个超时时间，在这么长的时间内尚未放进去元素，那就抛出Queue.Full异常。

另外队列中经常使用的方法

Queue.qsize() 返回队列的大小，不过在 Mac OS 上无法运行。

缘由：

def qsize(self):
# Raises NotImplementedError on Mac OSX because of broken sem_getvalue()
return self._maxsize – self._sem._semlock._get_value()

Queue.empty() 若是队列为空，返回True, 反之False

Queue.full() 若是队列满了，返回True,反之False

Queue.get([block[, timeout]]) 获取队列，timeout等待时间

Queue.get_nowait() 至关Queue.get(False)

Queue.put(item) 阻塞式写入队列，timeout等待时间

Queue.put_nowait(item) 至关Queue.put(item, False)

Pipe

管道，顾名思义，一端发一端收。

Pipe能够是单向(half-duplex)，也能够是双向(duplex)。咱们经过mutiprocessing.Pipe(duplex=False)建立单向管道 (默认为双向)。一个进程从PIPE一端输入对象，而后被PIPE另外一端的进程接收，单向管道只容许管道一端的进程输入，而双向管道则容许从两端输入。

用一个实例来感觉一下：

from multiprocessing import Process, Pipe

class Consumer(Process):

def __init__(self, pipe):

Process.__init__(self)

self.pipe = pipe

def run(self):

self.pipe.send('Consumer Words')

print 'Consumer Received:', self.pipe.recv()

class Producer(Process):

def __init__(self, pipe):

Process.__init__(self)

self.pipe = pipe

def run(self):

print 'Producer Received:', self.pipe.recv()

self.pipe.send('Producer Words')

if __name__ == '__main__':

pipe = Pipe()

p = Producer(pipe[0])

c = Consumer(pipe[1])

p.daemon = c.daemon = True

p.start()

c.start()

p.join()

c.join()

print 'Ended!'

在这里声明了一个默认为双向的管道，而后将管道的两端分别传给两个进程。两个进程互相收发。观察一下结果：

Producer Received: Consumer Words

Consumer Received: Producer Words

Ended!

以上是对pipe的简单介绍。

Pool

在利用Python进行系统管理的时候，特别是同时操做多个文件目录，或者远程控制多台主机，并行操做能够节约大量的时间。当被操做对象数目不大时，能够直接利用multiprocessing中的Process动态成生多个进程，十几个还好，但若是是上百个，上千个目标，手动的去限制进程数量却又太过繁琐，此时能够发挥进程池的功效。
Pool能够提供指定数量的进程，供用户调用，当有新的请求提交到pool中时，若是池尚未满，那么就会建立一个新的进程用来执行该请求；但若是池中的进程数已经达到规定最大值，那么该请求就会等待，直到池中有进程结束，才会建立新的进程来它。

在这里须要了解阻塞和非阻塞的概念。

阻塞和非阻塞关注的是程序在等待调用结果（消息，返回值）时的状态。

阻塞即要等到回调结果出来，在有结果以前，当前进程会被挂起。

Pool的用法有阻塞和非阻塞两种方式。非阻塞即为添加进程后，不必定非要等到改进程执行完就添加其余进程运行，阻塞则相反。

现用一个实例感觉一下非阻塞的用法：

from multiprocessing import Lock, Pool

import time

def function(index):

print 'Start process: ', index

time.sleep(3)

print 'End process', index

if __name__ == '__main__':

pool = Pool(processes=3)

for i in xrange(4):

pool.apply_async(function, (i,))

print "Started processes"

pool.close()

pool.join()

print "Subprocess done."

在这里利用了apply_async方法，即非阻塞。

运行结果：

Started processes

Start process: Start process: 0

Start process: 2

End processEnd process 0

Start process: 3

End process 2

End process 3

Subprocess done.

能够发如今这里添加三个进程进去后，立马就开始执行，不用非要等到某个进程结束后再添加新的进程进去。

下面再看看阻塞的用法：

from multiprocessing import Lock, Pool

import time

def function(index):

print 'Start process: ', index

time.sleep(3)

print 'End process', index

if __name__ == '__main__':

pool = Pool(processes=3)

for i in xrange(4):

pool.apply(function, (i,))

print "Started processes"

pool.close()

pool.join()

print "Subprocess done."

在这里只须要把apply_async改为apply便可。

运行结果以下：

Start process: 0

End process 0

Start process: 1

End process 1

Start process: 2

End process 2

Start process: 3

End process 3

Started processes

Subprocess done.

这样一来就好理解了吧？

下面对函数进行解释：

apply_async(func[, args[, kwds[, callback]]]) 它是非阻塞，apply(func[, args[, kwds]])是阻塞的。

close() 关闭pool，使其不在接受新的任务。

terminate() 结束工做进程，不在处理未完成的任务。

join() 主进程阻塞，等待子进程的退出， join方法要在close或terminate以后使用。

固然每一个进程能够在各自的方法返回一个结果。apply或apply_async方法能够拿到这个结果并进一步进行处理。

from multiprocessing import Lock, Pool

import time

def function(index):

print 'Start process: ', index

time.sleep(3)

print 'End process', index

return index

if __name__ == '__main__':

pool = Pool(processes=3)

for i in xrange(4):

result = pool.apply_async(function, (i,))

print result.get()

print "Started processes"

pool.close()

pool.join()

print "Subprocess done."

运行结果：

Start process: 0

End process 0

Start process: 1

End process 1

Start process: 2

End process 2

Start process: 3

End process 3

Started processes

Subprocess done.

另外还有一个很是好用的map方法。

若是你如今有一堆数据要处理，每一项都须要通过一个方法来处理，那么map很是适合。

好比如今你有一个数组，包含了全部的URL，而如今已经有了一个方法用来抓取每一个URL内容并解析，那么能够直接在map的第一个参数传入方法名，第二个参数传入URL数组。

如今咱们用一个实例来感觉一下：

from multiprocessing import Pool

import requests

from requests.exceptions import ConnectionError

def scrape(url):

try:

print requests.get(url)

except ConnectionError:

print 'Error Occured ', url

finally:

print 'URL ', url, ' Scraped'

if __name__ == '__main__':

pool = Pool(processes=3)

urls = [

'https://www.baidu.com',

'http://www.meituan.com/',

'http://blog.csdn.net/',

'http://xxxyxxx.net'

]

pool.map(scrape, urls)

在这里初始化一个Pool，指定进程数为3，若是不指定，那么会自动根据CPU内核来分配进程数。

而后有一个连接列表，map函数能够遍历每一个URL，而后对其分别执行scrape方法。

运行结果：

URL http://blog.csdn.net/ Scraped

URL https://www.baidu.com Scraped

Error Occured http://xxxyxxx.net

URL http://xxxyxxx.net Scraped

URL http://www.meituan.com/ Scraped

能够看到遍历就这么轻松地实现了。

结语

多进程multiprocessing相比多线程功能强大太多，并且使用范围更广，但愿本文对你们有帮助！

本文参考

https://docs.python.org/2/library/multiprocessing.html

http://www.cnblogs.com/vamei/archive/2012/10/12/2721484.html

http://www.cnblogs.com/kaituorensheng/p/4445418.html

https://my.oschina.net/yangyanxing/blog/296052

转载：静觅 » Python爬虫进阶六之多进程的用法