Python.Scrapy.11-scrapy-source-code-analysis-part-1

时间 2019-11-20

标签 python.scrapy.11 python scrapy source code analysis 栏目 Python 繁體版

原文原文链接

Scrapy 源代码分析系列－1 spider, spidermanager, crawler, cmdline, command

分析的源代码版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6html

如github 中Scrapy 源码树所示，包含的子包有:python

commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlibreact

包含的模块有:git

_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py, github

extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py, web

middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,shell

spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py app

先从重要的模块进行分析。框架

0. scrapy依赖的第三方库或者框架

twistedscrapy

1. 模块: spider, spidermanager, crawler, cmdline, command

1.1 spider.py spidermanager.py crawler.py

spider.py定义了spider的基类: BaseSpider. 每一个spider实例只能有一个crawler属性。那么crawler具有哪些功能呢?

crawler.py定义了类Crawler，CrawlerProcess。

类Crawler依赖: SignalManager, ExtensionManager, ExecutionEngine, 以及设置项STATS_CLASS、SPIDER_MANAGER_CLASS

、LOG_FORMATTER

类CrawlerProcess: 顺序地在一个进程中运行多个Crawler。依赖: twisted.internet.reactor、twisted.internet.defer。

启动爬行(Crawlering)。该类在1.2中cmdline.py会涉及。

spidermanager.py定义类SpiderManager, 类SpiderManager用来建立和管理全部website-specific的spider。

 1 class SpiderManager(object):
 2 
 3     implements(ISpiderManager)
 4 
 5     def __init__(self, spider_modules):
 6         self.spider_modules = spider_modules
 7         self._spiders = {}
 8         for name in self.spider_modules:  9             for module in walk_modules(name): 10                 self._load_spiders(module)
11 
12     def _load_spiders(self, module):
13         for spcls in iter_spider_classes(module):
14             self._spiders[spcls.name] = spcls

1.2 cmdline.py command.py

cmdline.py定义了公有函数: execute(argv=None, settings=None)。

函数execute是工具scrapy的入口方法(entry method)，以下所示:

 1 XiaoKL$ cat `which scrapy`
 2 #!/usr/bin/python
 3 
 4 # -*- coding: utf-8 -*-
 5 import re
 6 import sys
 7 
 8 from scrapy.cmdline import execute
 9 
10 if __name__ == '__main__':
11     sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
12     sys.exit(execute())

因此能够根据这个点为切入点进行scrapy源码的分析。下面是execute()函数:

 1 def execute(argv=None, settings=None):
 2     if argv is None:
 3         argv = sys.argv
 4 
 5     # --- backwards compatibility for scrapy.conf.settings singleton ---
 6     if settings is None and 'scrapy.conf' in sys.modules:
 7         from scrapy import conf
 8         if hasattr(conf, 'settings'):
 9             settings = conf.settings
10     # ------------------------------------------------------------------
11 
12     if settings is None:
13         settings = get_project_settings()
14     check_deprecated_settings(settings)
15 
16     # --- backwards compatibility for scrapy.conf.settings singleton ---
17     import warnings
18     from scrapy.exceptions import ScrapyDeprecationWarning
19     with warnings.catch_warnings():
20         warnings.simplefilter("ignore", ScrapyDeprecationWarning)
21         from scrapy import conf
22         conf.settings = settings
23     # ------------------------------------------------------------------
24 
25     inproject = inside_project()
26     cmds = _get_commands_dict(settings, inproject)
27     cmdname = _pop_command_name(argv)
28     parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
29         conflict_handler='resolve')
30     if not cmdname:
31         _print_commands(settings, inproject)
32         sys.exit(0)
33     elif cmdname not in cmds:
34         _print_unknown_command(settings, cmdname, inproject)
35         sys.exit(2)
36 
37     cmd = cmds[cmdname]
38     parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
39     parser.description = cmd.long_desc()
40     settings.setdict(cmd.default_settings, priority='command')
41     cmd.settings = settings
42     cmd.add_options(parser)
43     opts, args = parser.parse_args(args=argv[1:])
44     _run_print_help(parser, cmd.process_options, args, opts)
45 
46     cmd.crawler_process = CrawlerProcess(settings) 47  _run_print_help(parser, _run_command, cmd, args, opts) 48     sys.exit(cmd.exitcode)

execute()函数主要作: 对命令行进行解析并对scrapy命令模块进行加载；解析命令行参数；获取设置信息；建立CrawlerProcess对象。

CrawlerProcess对象、设置信息、命令行参数都赋值给ScrapyCommand(或其子类)的对象。

天然咱们须要来查看定义类ScrapyCommand的模块: command.py。

ScrapyCommand的子类在子包scrapy.commands中进行定义。

_run_print_help() 函数最终调用cmd.run()，来执行该命令。以下:

1 def _run_print_help(parser, func, *a, **kw):
2     try:
3         func(*a, **kw)
4     except UsageError as e:
5         if str(e):
6             parser.error(str(e))
7         if e.print_help:
8             parser.print_help()
9         sys.exit(2)

func是参数_run_command，该函数的实现主要就是调用cmd.run()方法:

1 def _run_command(cmd, args, opts):
2     if opts.profile or opts.lsprof:
3         _run_command_profiled(cmd, args, opts)
4     else:
5         cmd.run(args, opts)

咱们在进行设计时能够参考这个cmdline/commands无关的设计。

command.py: 定义类ScrapyCommand，该类做为Scrapy Commands的基类。来简单看一下类ScrapyCommand提供的接口/方法:

 1 class ScrapyCommand(object):
 2 
 3     requires_project = False
 4     crawler_process = None
 5 
 6     # default settings to be used for this command instead of global defaults
 7     default_settings = {}
 8 
 9     exitcode = 0
10 
11     def __init__(self):
12         self.settings = None  # set in scrapy.cmdline
13 
14     def set_crawler(self, crawler):
15         assert not hasattr(self, '_crawler'), "crawler already set"
16         self._crawler = crawler
17 
18     @property
19     def crawler(self):
20         warnings.warn("Command's default `crawler` is deprecated and will be removed. "
21             "Use `create_crawler` method to instatiate crawlers.",
22             ScrapyDeprecationWarning)
23 
24         if not hasattr(self, '_crawler'):
25             crawler = self.crawler_process.create_crawler()
26 
27             old_start = crawler.start
28             self.crawler_process.started = False
29 
30             def wrapped_start():
31                 if self.crawler_process.started:
32                     old_start()
33                 else:
34                     self.crawler_process.started = True
35                     self.crawler_process.start()
36 
37             crawler.start = wrapped_start
38 
39             self.set_crawler(crawler)
40 
41         return self._crawler
42 
43     def syntax(self):
44 
45     def short_desc(self):
46 
47     def long_desc(self):
48 
49     def help(self):
50 
51     def add_options(self, parser):
52 
53     def process_options(self, args, opts):
54     
55     def run(self, args, opts):

类ScrapyCommand的类属性:

requires_project: 是否须要在Scrapy project中运行

crawler_process：CrawlerProcess对象。在cmdline.py的execute()函数中进行设置。

类ScrapyCommand的方法，重点关注:

def crawler(self): 延迟建立Crawler对象。

def run(self, args, opts): 须要子类进行覆盖实现。

那么咱们来具体看一个ScrapyCommand的子类(参考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。

To Be Continued:

接下来分析模块: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2

更多相关文章...