扩展提供了一种机制,添加自定义的功能到 scrapyweb
例如,爬虫状态统计扩展,统计爬虫的运行信息框架
在 scrapy 启动时建立扩展的单一实例对象,添加扩展的配置到 settings.py 文件scrapy
# 下面的字典定义了加载的扩展,字符串是扩展的路径名,数字是加载插件的顺序 EXTENSIONS = { 'scrapy.contrib.corestats.CoreStats': 500, 'scrapy.webservice.WebService': 500, 'scrapy.telnet.TelnetConsole': 500,} # 启用插件 MYEXT_ENABLED = True
from_crawler 是建立插件(下载插件,扩展插件等)实例时,被框架调用的方法。能够检查配置,传递配置信息(能够从 crawler.settings 对象获取 settings.py 中定义的配置信息)ide
调用 crawler.signals.connect 注册事件回调函数,在事件发生时被框架回调函数
from scrapy import signals from scrapy.exceptions import NotConfigured class SpiderOpenCloseLogging(object): def __init__(self, item_count): self.item_count = item_count self.items_scraped = 0 @classmethod def from_crawler(cls, crawler): # 检查插件是否启用,若是没设置插件,抛出异常 NotConfigured if not crawler.settings.getbool('MYEXT_ENABLED'): raise NotConfigured # 从配置文件 settings.py 获取配置 item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000) # 建立插件对象实例 ext = cls(item_count) # 监听 spider 打开事件 crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) # 监听 spider 关闭事件 crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) # 监听 item 获取事件(获取抓取的结果) crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped) # 返回扩展对象 return ext def spider_opened(self, spider): spider.log("opened spider %s" % spider.name) def spider_closed(self, spider): spider.log("closed spider %s" % spider.name) def item_scraped(self, item, spider): self.items_scraped += 1 if self.items_scraped == self.item_count: spider.log("scraped %d items, resetting counter" % self.items_scraped) self.item_count = 0