爬虫主体写完了,要部署运行,还有一些工程性问题;html
CONCURRENT_ITEMS # item队列最大容量python
Default: 100数据库
Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).服务器
CONCURRENT_REQUESTS网络
Default: 16并发
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.app
实现原理:框架
next_request中while循环,调用needs_back,若是len(self.active)>它,则继续空循环,不然下一步;dom
核心仍是engine.py,会查询downloader,slot等是否须要等待scrapy
CONCURRENT_REQUESTS_PER_DOMAIN
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.
CONCURRENT_REQUESTS_PER_IP
Default: 0
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.
This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.
DOWNLOAD_DELAY 下载间隔时间
DEPTH_LIMIT
首先,scrapy自带关闭扩展,能够在setting中设置自动关闭条件。
python3.6.4\Lib\site-packages\scrapy\extensions
核心就这么四句
crawler.signals.connect(self.error_count, signal=signals.spider_error)
crawler.signals.connect(self.page_count, signal=signals.response_received)
crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)
把信号connect到方法,每次达到条件就触发相应信号,到这里调用相应函数,完成操做。
操做函数示例:
def error_count(self, failure, response, spider):
self.counter['errorcount'] += 1
if self.counter['errorcount'] == self.close_on['errorcount']:
self.crawler.engine.close_spider(spider, 'closespider_errorcount')
固然,也能够自行判断而后抛出异常,调用已有关闭方法;
在spider中:raise CloseSpider(‘bandwidth_exceeded’)
在其它组件中(middlewares, pipeline, etc):
crawler.engine.close_spider(self, ‘log message’)
主要是内存
MEMDEBUG_ENABLED
Default: False
Whether to enable memory debugging.
MEMDEBUG_NOTIFY
Default: []
When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.
Example:MEMDEBUG_NOTIFY = ['user@example.com']
MEMUSAGE_LIMIT_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.
MEMUSAGE_CHECK_INTERVAL_SECONDS
New in version 1.1.
Default: 60.0
Scope: scrapy.extensions.memusage
MEMUSAGE_NOTIFY_MAIL
Default: False
Scope: scrapy.extensions.memusage
A list of emails to notify if the memory limit has been reached.
Example:
MEMUSAGE_NOTIFY_MAIL = ['user@example.com']
MEMUSAGE_WARNING_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero, no warning will be produced.
这个得结合通常爬虫分布爬取任务划分状态来讲,
大部分必定规模的爬虫框架都是一个主爬虫+多个子爬虫,主爬虫负责爬取列表页并从列表页中获取子页地址,生成二级请求队列;子爬虫负责从二级请求队列中读取请求,完成爬取。
爬虫监控信息及做用主要包括三种:
用于判断当前爬虫进度
进程监控:爬虫进程挂了,须要重启;
资源监控:爬虫使用资源超标,须要重启释放;
主要是异常中止或被ban,须要人工干预;
整体来讲,监控日志监控是比较方便的方法,把日志写入数据库或使用sockethandler,
而后另写功能模块负责完成监控,展现,处理功能,可使爬虫主体功能简单化,另外一方面能够避开多进程写日志调度这个麻烦。
from scrapy.mail import MailSender
mailer = MailSender(
smtphost = "smtp.163.com", # 发送邮件的服务器
mailfrom = "***********@163.com", # 邮件发送者
smtpuser = "***********@163.com", # 用户名
smtppass = "***********", # 发送密码不是登录密码,而是受权码!
smtpport = 25 # 端口号
)
body = u"""发送的邮件内容"""
subject = u'发送的邮件标题'
# 若是发送的内容太过简单的话,极可能会部分邮箱的反垃圾设置给禁止
mailer.send(to=["****@qq.com", "****@qq.com"], subject = subject.encode("utf-8"), body = body.encode("utf-8"))能会被当作垃圾邮件给禁止发送。
性能问题有缘由有多种:
在爬虫中,CPU最容易出现瓶颈的地方是解析,
纵向扩展是提升解析效率,使用更高效率的库,或条件容许直接上正则;另外scrapy在解析时是单线程的,能够考虑使用gevent;
横向扩展是多进程,在一台服务器上同时运行多个爬虫;