笔记-scrapy-辅助功能

时间 2019-11-25

标签笔记 scrapy 辅助功能栏目 Python 繁體版

原文原文链接

笔记-scrapy-辅助功能

1. scrapy爬虫管理

爬虫主体写完了，要部署运行，还有一些工程性问题；html

限频
爬取深度限制
按条件中止，例如爬取次数，错误次数；
资源使用限制，例如内存限制；
状态报告，邮件
性能问题。

2. 限频

CONCURRENT_ITEMS # item队列最大容量python

Default: 100数据库

Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).服务器

CONCURRENT_REQUESTS网络

Default: 16并发

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.app

实现原理：框架

next_request中while循环，调用needs_back，若是len(self.active)>它，则继续空循环，不然下一步；dom

核心仍是engine.py，会查询downloader,slot等是否须要等待scrapy

CONCURRENT_REQUESTS_PER_DOMAIN

Default: 8

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.

CONCURRENT_REQUESTS_PER_IP

Default: 0

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

DOWNLOAD_DELAY 下载间隔时间

3. 爬取深度限制

DEPTH_LIMIT

4. 按条件中止爬虫

首先，scrapy自带关闭扩展，能够在setting中设置自动关闭条件。

python3.6.4\Lib\site-packages\scrapy\extensions

CLOSESPIDER_TIMEOUT

CLOSESPIDER_ITEMCOUNT

CLOSESPIDER_PAGECOUNT

CLOSESPIDER_ERRORCOUNT

核心就这么四句

crawler.signals.connect(self.error_count, signal=signals.spider_error)

crawler.signals.connect(self.page_count, signal=signals.response_received)

crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)

crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)

把信号connect到方法，每次达到条件就触发相应信号，到这里调用相应函数，完成操做。

操做函数示例：

def error_count(self, failure, response, spider):

self.counter['errorcount'] += 1

if self.counter['errorcount'] == self.close_on['errorcount']:

self.crawler.engine.close_spider(spider, 'closespider_errorcount')

固然，也能够自行判断而后抛出异常，调用已有关闭方法；

在spider中：raise CloseSpider(‘bandwidth_exceeded’)

在其它组件中(middlewares, pipeline, etc)：

crawler.engine.close_spider(self, ‘log message’)

5. 资源使用限制

主要是内存

MEMDEBUG_ENABLED

Default: False

Whether to enable memory debugging.

MEMDEBUG_NOTIFY

Default: []

When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.

Example:MEMDEBUG_NOTIFY = ['user@example.com']

MEMUSAGE_LIMIT_MB

Default: 0

Scope: scrapy.extensions.memusage

The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.

MEMUSAGE_CHECK_INTERVAL_SECONDS

New in version 1.1.

Default: 60.0

Scope: scrapy.extensions.memusage

MEMUSAGE_NOTIFY_MAIL

Default: False

Scope: scrapy.extensions.memusage

A list of emails to notify if the memory limit has been reached.

Example:

MEMUSAGE_NOTIFY_MAIL = ['user@example.com']

See Memory usage extension.

MEMUSAGE_WARNING_MB

Default: 0

Scope: scrapy.extensions.memusage

The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero, no warning will be produced.

6. 状态监控及控制

这个得结合通常爬虫分布爬取任务划分状态来讲，

大部分必定规模的爬虫框架都是一个主爬虫+多个子爬虫，主爬虫负责爬取列表页并从列表页中获取子页地址，生成二级请求队列；子爬虫负责从二级请求队列中读取请求，完成爬取。

爬虫监控信息及做用主要包括三种：

爬虫状态：包括爬虫已运行时间，已下载网页数，已下载目标数，请求队列大小，已发生错误数；

用于判断当前爬虫进度

爬虫进程状态：

进程监控：爬虫进程挂了，须要重启；

资源监控：爬虫使用资源超标，须要重启释放；

其它爬虫异常状态异常；

主要是异常中止或被ban，须要人工干预；

整体来讲，监控日志监控是比较方便的方法，把日志写入数据库或使用sockethandler，

而后另写功能模块负责完成监控，展现，处理功能，可使爬虫主体功能简单化，另外一方面能够避开多进程写日志调度这个麻烦。

7. 邮件

from scrapy.mail import MailSender

mailer = MailSender(

smtphost = "smtp.163.com", # 发送邮件的服务器

mailfrom = "***********@163.com", # 邮件发送者

smtpuser = "***********@163.com", # 用户名

smtppass = "***********", # 发送密码不是登录密码，而是受权码！

smtpport = 25 # 端口号

)

body = u"""发送的邮件内容"""

subject = u'发送的邮件标题'

# 若是发送的内容太过简单的话，极可能会部分邮箱的反垃圾设置给禁止

mailer.send(to=["****@qq.com", "****@qq.com"], subject = subject.encode("utf-8"), body = body.encode("utf-8"))能会被当作垃圾邮件给禁止发送。

8. 性能问题

性能问题有缘由有多种：

并发数过小，未充分使用CPU，网络及IO，扩大并发数就能够了；把单IP或单域名的并发数改大；
网络，磁盘io，这个只能升级或分布式；通常够得上这种问题的爬虫规模小不了，也只能分布式了；不过这样也太嚣张了，容易被反爬；
CPU

在爬虫中，CPU最容易出现瓶颈的地方是解析，

纵向扩展是提升解析效率，使用更高效率的库，或条件容许直接上正则；另外scrapy在解析时是单线程的，能够考虑使用gevent；

横向扩展是多进程，在一台服务器上同时运行多个爬虫；