试想一下,前面作的实验和例子都只有一个spider。然而,现实的开发的爬虫确定不止一个。既然这样,那么就会有以下几个问题:一、在同一个项目中怎么建立多个爬虫的呢?二、多个爬虫的时候是怎么将他们运行起来呢?html
说明:本文章是基于前面几篇文章和实验的基础上完成的。若是您错过了,或者有疑惑的地方能够在此查看:python
安装python爬虫scrapy踩过的那些坑和编程外的思考mysql
scrapy爬虫成长日记之建立工程-抽取数据-保存为json格式的数据react
scrapy爬虫成长日记之将抓取内容写入mysql数据库git
如何让你的scrapy爬虫再也不被bangithub
1、建立spidersql
一、建立多个spider,scrapy genspider spidername domainshell
scrapy genspider CnblogsHomeSpider cnblogs.com
经过上述命令建立了一个spider name为CnblogsHomeSpider的爬虫,start_urls为http://www.cnblogs.com/的爬虫数据库
二、查看项目下有几个爬虫scrapy list编程
[root@bogon cnblogs]# scrapy list CnblogsHomeSpider CnblogsSpider
由此能够知道个人项目下有两个spider,一个名称叫CnblogsHomeSpider,另外一个叫CnblogsSpider。
更多关于scrapy命令可参考:http://doc.scrapy.org/en/latest/topics/commands.html
2、让几个spider同时运行起来
如今咱们的项目有两个spider,那么如今咱们怎样才能让两个spider同时运行起来呢?你可能会说写个shell脚本一个个调用,也可能会说写个python脚本一个个运行等。然而我在stackoverflow.com上看到。的确也有不上前辈是这么实现。然而官方文档是这么介绍的。
import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition ... process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished
这里主要经过scrapy.crawler.CrawlerProcess来实如今脚本里运行一个spider。更多的例子能够在此查看:https://github.com/scrapinghub/testspiders
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished
import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished
这是官方文档提供的几种在script里面运行spider的方法。
3、经过自定义scrapy命令的方式来运行
一、建立commands目录
mkdir commands
注意:commands和spiders目录是同级的
二、在commands下面添加一个文件crawlall.py
这里主要经过修改scrapy的crawl命令来完成同时执行spider的效果。crawl的源码能够在此查看:https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py
from scrapy.commands import ScrapyCommand from scrapy.crawler import CrawlerRunner from scrapy.utils.conf import arglist_to_dict class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def add_options(self, parser): ScrapyCommand.add_options(self, parser) parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", help="set spider argument (may be repeated)") parser.add_option("-o", "--output", metavar="FILE", help="dump scraped items into FILE (use - for stdout)") parser.add_option("-t", "--output-format", metavar="FORMAT", help="format to use for dumping items with -o") def process_options(self, args, opts): ScrapyCommand.process_options(self, args, opts) try: opts.spargs = arglist_to_dict(opts.spargs) except ValueError: raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False) def run(self, args, opts): #settings = get_project_settings() spider_loader = self.crawler_process.spider_loader for spidername in args or spider_loader.list(): print "*********cralall spidername************" + spidername self.crawler_process.crawl(spidername, **opts.spargs) self.crawler_process.start()
这里主要是用了self.crawler_process.spider_loader.list()方法获取项目下全部的spider,而后利用self.crawler_process.crawl运行spider
三、commands命令下添加__init__.py文件
touch __init__.py
注意:这一步必定不能省略。我就是由于这个问题折腾了一天。囧。。。就怪本身半路出家的吧。
若是省略了会报这样一个异常
Traceback (most recent call last): File "/usr/local/bin/scrapy", line 9, in <module> load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')() File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute cmds = _get_commands_dict(settings, inproject) File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict cmds.update(_get_commands_from_module(cmds_module, inproject)) File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module for cmd in _iter_command_classes(module): File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes for module in walk_modules(module_name): File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules mod = import_module(path) File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module __import__(name) ImportError: No module named commands
一开始怎么找都找不到缘由在哪。耗了我一成天,后来到http://stackoverflow.com/上获得了网友的帮助。再次感谢万能的互联网,要是没有那道墙该是多么的美好呀!扯远了,继续回来。
四、settings.py目录下建立setup.py(这一步去掉也没影响,不知道官网帮助文档这么写有什么具体的意义。)
from setuptools import setup, find_packages setup(name='scrapy-mymodule', entry_points={ 'scrapy.commands': [ 'crawlall=cnblogs.commands:crawlall', ], }, )
这个文件的含义是定义了一个crawlall命令,cnblogs.commands为命令文件目录,crawlall为命令名。
5. 在settings.py中添加配置:
COMMANDS_MODULE = 'cnblogs.commands'
6. 运行命令scrapy crawlall