同时运行多个scrapy爬虫的几种方法（自定义scrapy项目命令）

时间 2019-11-09

标签同时运行多个 scrapy 爬虫几种方法自定义项目命令栏目 Python 繁體版

原文原文链接

　　试想一下，前面作的实验和例子都只有一个spider。然而，现实的开发的爬虫确定不止一个。既然这样，那么就会有以下几个问题：一、在同一个项目中怎么建立多个爬虫的呢？二、多个爬虫的时候是怎么将他们运行起来呢？html

　　说明：本文章是基于前面几篇文章和实验的基础上完成的。若是您错过了，或者有疑惑的地方能够在此查看：python

　　安装python爬虫scrapy踩过的那些坑和编程外的思考mysql

　　scrapy爬虫成长日记之建立工程-抽取数据-保存为json格式的数据react

　　scrapy爬虫成长日记之将抓取内容写入mysql数据库git

　　如何让你的scrapy爬虫再也不被bangithub

　　1、建立spidersql

　　一、建立多个spider，scrapy genspider spidername domainshell

scrapy genspider CnblogsHomeSpider cnblogs.com

　　经过上述命令建立了一个spider name为CnblogsHomeSpider的爬虫，start_urls为http://www.cnblogs.com/的爬虫数据库

　　二、查看项目下有几个爬虫scrapy list编程

[root@bogon cnblogs]# scrapy list
CnblogsHomeSpider
CnblogsSpider

　　由此能够知道个人项目下有两个spider，一个名称叫CnblogsHomeSpider，另外一个叫CnblogsSpider。

　　更多关于scrapy命令可参考：http://doc.scrapy.org/en/latest/topics/commands.html

　　2、让几个spider同时运行起来

　　如今咱们的项目有两个spider，那么如今咱们怎样才能让两个spider同时运行起来呢？你可能会说写个shell脚本一个个调用，也可能会说写个python脚本一个个运行等。然而我在stackoverflow.com上看到。的确也有不上前辈是这么实现。然而官方文档是这么介绍的。

　　一、Run Scrapy from a script

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

　　这里主要经过scrapy.crawler.CrawlerProcess来实如今脚本里运行一个spider。更多的例子能够在此查看：https://github.com/scrapinghub/testspiders

　　二、Running multiple spiders in the same process

经过CrawlerProcess

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

经过CrawlerRunner

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

经过CrawlerRunner和连接(chaining) deferred来线性运行

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

　　这是官方文档提供的几种在script里面运行spider的方法。

　　3、经过自定义scrapy命令的方式来运行

　　建立项目命令可参考：http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

　　一、建立commands目录

mkdir commands

　　注意：commands和spiders目录是同级的

　　二、在commands下面添加一个文件crawlall.py

　　这里主要经过修改scrapy的crawl命令来完成同时执行spider的效果。crawl的源码能够在此查看：https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

from scrapy.commands import ScrapyCommand  
from scrapy.crawler import CrawlerRunner
from scrapy.utils.conf import arglist_to_dict

class Command(ScrapyCommand):
  
    requires_project = True
  
    def syntax(self):  
        return '[options]'  
  
    def short_desc(self):  
        return 'Runs all of the spiders'  

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
                          help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE",
                          help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT",
                          help="format to use for dumping items with -o")

    def process_options(self, args, opts):
        ScrapyCommand.process_options(self, args, opts)
        try:
            opts.spargs = arglist_to_dict(opts.spargs)
        except ValueError:
            raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)

    def run(self, args, opts):
        #settings = get_project_settings()

        spider_loader = self.crawler_process.spider_loader
        for spidername in args or spider_loader.list():
            print "*********cralall spidername************" + spidername
            self.crawler_process.crawl(spidername, **opts.spargs)

        self.crawler_process.start()

　　这里主要是用了self.crawler_process.spider_loader.list()方法获取项目下全部的spider，而后利用self.crawler_process.crawl运行spider

　　三、commands命令下添加__init__.py文件

touch __init__.py

　　注意：这一步必定不能省略。我就是由于这个问题折腾了一天。囧。。。就怪本身半路出家的吧。

　　若是省略了会报这样一个异常

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute
    cmds = _get_commands_dict(settings, inproject)
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict
    cmds.update(_get_commands_from_module(cmds_module, inproject))
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules
    mod = import_module(path)
  File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named commands

　　一开始怎么找都找不到缘由在哪。耗了我一成天，后来到http://stackoverflow.com/上获得了网友的帮助。再次感谢万能的互联网，要是没有那道墙该是多么的美好呀！扯远了，继续回来。

　　四、settings.py目录下建立setup.py（这一步去掉也没影响，不知道官网帮助文档这么写有什么具体的意义。）

from setuptools import setup, find_packages

setup(name='scrapy-mymodule',
  entry_points={
    'scrapy.commands': [
      'crawlall=cnblogs.commands:crawlall',
    ],
  },
 )

　　这个文件的含义是定义了一个crawlall命令，cnblogs.commands为命令文件目录，crawlall为命令名。

　　5. 在settings.py中添加配置：

COMMANDS_MODULE = 'cnblogs.commands'

　　6. 运行命令scrapy crawlall

　　最后源码更新至此：https://github.com/jackgitgz/CnblogsSpider