之前一直使用PHP写爬虫,用Snoopy
配合simple_html_dom
用起来也挺好的,至少可以解决问题。html
PHP一直没有一个好用的多线程机制,虽然可使用一些trick的手段来实现并行的效果(例如借助apache或者nginx服务器等,或者fork一个子进程,或者直接动态生成多个PHP脚本多进程运行),可是不管从代码结构上,仍是从使用的复杂程度上,用起来都不是那么顺手。还据说过一个pthreads的PHP的扩展,这是一个真正可以实现PHP多线程的扩展,看github上它的介绍:Absolutely, this is not a hack, we don't use forking or any other such nonsense, what you create are honest to goodness posix threads that are completely compatible with PHP and safe ... this is true multi-threading :)python
扯远了,PHP的内容在本文中再也不赘述,既然决定尝试一下Python的采集,同时必定要学习一下Python的多线程知识的。之前一直听各类大牛们将Python有多么多么好用,不真正用一次试试,本身也无法明确Python具体的优点在哪,处理哪些问题用Python合适。nginx
废话就说这么多吧,进入正题git
search_config.py
github
1 #!/usr/bin/env python 2 # coding=utf-8 3 class config: 4 keyword = '青岛' 5 search_type = 'shop' 6 url = 'http://s.taobao.com/search?q=' + keyword + '&commend=all&search_type='+ search_type +'&sourceId=tb.index&initiative_id=tbindexz_20131207&app=shopsearch&s='
single_scrapy.py
redis
#!/usr/bin/env python # coding=utf-8 import requests from search_config import config class Scrapy(): def __init__(self, threadname, start_num): self.threadname = threadname self.start_num = start_num print threadname + 'start.....' def run(self): url = config.url + self.start_num response = requests.get(url) print self.threadname + 'end......' def main(): for i in range(0,13,6): scrapy = Scrapy('scrapy', str(i)) scrapy.run() if __name__ == '__main__': main()
这是最简单最常规的一种采集方式,按照顺序循环进行网络链接,获取页面信息。看截图可知,这种方式的效率实际上是极低的,一个链接进行网络I/O的时候,其余的必须等待前面的链接完成才能进行链接,换句话说,就是前面的链接阻塞的后面的链接。apache
#!/usr/bin/env python # coding=utf-8 import requests from search_config import config import threading class Scrapy(threading.Thread): def __init__(self, threadname, start_num): threading.Thread.__init__(self, name = threadname) self.threadname = threadname self.start_num = start_num print threadname + 'start.....' #重写Thread类的run方法 def run(self): url = config.url + self.start_num response = requests.get(url) print self.threadname + 'end......' def main(): for i in range(0,13,6): scrapy = Scrapy('scrapy', str(i)) scrapy.start() if __name__ == '__main__': main()
经过截图能够看到,采集一样数量的页面,经过开启多线程,时间缩短了不少,可是CPU利用率高了。安全
html页面信息拿到之后,咱们须要对其进行解析操做,从中提取出咱们所须要的信息,包含:服务器
使用BeautifulSoup这个库,能够直接按照class或者id等html的attr来进行提取,比直接写正则直观很多,难度也小了不少,固然,执行效率上,相应的也就大打折扣了。网络
这里使用Queue
实现一个生产者和消费者模式
Queue模块
#!/usr/bin/env python # coding=utf-8 import requests from BeautifulSoup import BeautifulSoup from search_config import config from Queue import Queue import threading class Scrapy(threading.Thread): def __init__(self, threadname, queue, out_queue): threading.Thread.__init__(self, name = threadname) self.sharedata = queue self.out_queue= out_queue self.threadname = threadname print threadname + 'start.....' def run(self): url = config.url + self.sharedata.get() response = requests.get(url) self.out_queue.put(response) print self.threadname + 'end......' class Parse(threading.Thread): def __init__(self, threadname, queue, out_queue): threading.Thread.__init__(self, name = threadname) self.sharedata = queue self.out_queue= out_queue self.threadname = threadname print threadname + 'start.....' def run(self): response = self.sharedata.get() body = response.content.decode('gbk').encode('utf-8') soup = BeautifulSoup(body) ul_html = soup.find('ul',{'id':'list-container'}) lists = ul_html.findAll('li',{'class':'list-item'}) stores = [] for list in lists: store= {} try: infos = list.findAll('a',{'trace':'shop'}) for info in infos: attrs = dict(info.attrs) if attrs.has_key('class'): if 'rank' in attrs['class']: rank_string = attrs['class'] rank_num = rank_string[-2:] if (rank_num[0] == '-'): store['rank'] = rank_num[-1] else: store['rank'] = rank_num if attrs.has_key('title'): store['title'] = attrs['title'] store['href'] = attrs['href'] except AttributeError: pass if store: stores.append(store) for store in stores: print store['title'] + ' ' + store['rank'] print self.threadname + 'end......' def main(): queue = Queue() targets = Queue() stores = Queue() scrapy = [] for i in range(0,13,6): #queue 原始请求 #targets 等待解析的内容 #stores解析完成的内容,这里为了简单直观,直接在线程中输出了内容,并无使用该队列 queue.put(str(i)) scrapy = Scrapy('scrapy', queue, targets) scrapy.start() parse = Parse('parse', targets, stores) parse.start() if __name__ == '__main__': main()
看这个运行结果,能够看到,咱们的scrapy过程很快就完成了,咱们的parse也很早就开始了,但是在运行的时候,却卡在parse上好长时间才出的运行结果,每个解析结果出现,都须要3~5秒的时间,虽然我用的是台老IBM破本,但按理说使用了多线程之后不该该会这么慢的啊。
一样的数据,咱们再看一下单线程下,运行结果。这里为了方便,我在上一个multi_scrapy里加入了redis,使用redis存储爬行下来的原始页面,这样在single_parse.py里面能够单独使用,更方便一些。
#!/usr/bin/env python # coding=utf-8 from BeautifulSoup import BeautifulSoup import redis class Parse(): def __init__(self, threadname, content): self.threadname = threadname self.content = content print threadname + 'start.....' def run(self): response = self.content if response: body = response.decode('gbk').encode('utf-8') soup = BeautifulSoup(body) ul_html = soup.find('ul',{'id':'list-container'}) lists = ul_html.findAll('li',{'class':'list-item'}) stores = [] for list in lists: store= {} try: infos = list.findAll('a',{'trace':'shop'}) for info in infos: attrs = dict(info.attrs) if attrs.has_key('class'): if 'rank' in attrs['class']: rank_string = attrs['class'] rank_num = rank_string[-2:] if (rank_num[0] == '-'): store['rank'] = rank_num[-1] else: store['rank'] = rank_num if attrs.has_key('title'): store['title'] = attrs['title'] store['href'] = attrs['href'] except AttributeError: pass if store: stores.append(store) for store in stores: try: print store['title'] + ' ' + store['rank'] except KeyError: pass print self.threadname + 'end......' else: pass def main(): r = redis.StrictRedis(host='localhost', port=6379) while True: content = r.lpop('targets') if (content): parse = Parse('parse', content) parse.run() else: break if __name__ == '__main__': main()
结果能够看到,单线程版本中,耗时其实和多线程是差很少的,上文中的多线程版本,虽然包含了获取页面的时间,可是地一个例子里咱们已经分析了,使用多线程之后,三个页面的抓取,彻底能够在1s内完成的,也就是说,使用多线程进行数据解析,并无得到实质上的效率提升。
GIL既然是针对一个python解释器进程而言的,那么,若是解释器能够多进程解释执行,那就不存在GIL的问题了。一样,他也不会致使你多个解释器跑在同一个核上。 因此,最好的解决方案,是多线程+多进程结合。经过多线程来跑I/O密集型程序,经过控制合适数量的进程来跑CPU密集型的操做,这样就能够跑慢CPU了:)