Prometheus
监控 Redis cluster
,其实套路都是同样的,使用 exporter
。
exporter
负责采集指标,经过 http
暴露给 Prometheus
拉取。granafa
则经过这些指标绘图展现数据。Prometheus
收集的数据还会根据你设置的告警规则判断是否要发送给 Alertmanager
, Alertmanager
则要判断是否要发出告警。html
Alertmanager
告警分为三个阶段linux
扯远了,开始监控 Redis cluster
git
监控什么应用,使用的相应的 exporter
,能够在官网查到。EXPORTERS AND INTEGRATIONS
github
Redis
使用 redis_exporter
,连接:redis_exporterweb
支持 Redis 2.x - 5.xredis
下载地址api
wget https://github.com/oliver006/redis_exporter/releases/download/v1.3.5/redis_exporter-v1.3.5.linux-amd64.tar.gz tar zxvf redis_exporter-v1.3.5.linux-amd64.tar.gz cd redis_exporter-v1.3.5.linux-amd64/ ./redis_exporter <flags>
redis_exporter
支持的参数不少,对咱们有用的就几个。ruby
./redis_exporter --help Usage of ./redis_exporter: -redis.addr string Address of the Redis instance to scrape (default "redis://localhost:6379") -redis.password string Password of the Redis instance to scrape -web.listen-address string Address to listen on for web interface and telemetry. (default ":9121")
nohup ./redis_exporter -redis.addr 172.18.11.138:6379 -redis.password xxxxx &
Prometheus
添加单实例bash
- job_name: redis_since static_configs: - targets: ['172.18.11.138:9121']
这个挺费劲的,网上查了不少资料,大都是监控单实例的,就这个是集群的,恰恰他的集群是没密码的。
prometheus监控redis集群
post
我试过的方案:
如下两种都会提示认证失败
level=error msg="Redis INFO err: NOAUTH Authentication required."
方法一
nohup ./redis_exporter -redis.addr 172.18.11.139:7000 172.18.11.139:7001 172.18.11.140:7002 172.18.11.140:7003 172.18.11.141:7004 172.18.11.141:7005 -redis.password xxxxx &
方法二
nohup ./redis_exporter -redis.addr redis://h:Lcsmy.312==/@172.18.11.139:7000 redis://h:Lcsmy.312==/@172.18.11.139:7001 redis://h:Lcsmy.312==/@172.18.11.140:7002 redis://h:Lcsmy.312==/@172.18.11.140:7003 redis://h:Lcsmy.312==/@172.18.11.141:7004 redis://h:Lcsmy.312==/@172.18.11.141:7005 -redis.password xxxxx &
原本想采起最low 的方法,一个实例启一个 redis_exporter
。这样子的话,集群那里不少语句都用不了,好比 cluster_slot_fail
。放弃该方法
nohup ./redis_exporter -redis.addr 172.18.11.139:7000 -redis.password xxxxxx -web.listen-address 172.18.11.139:9121 > /dev/null 2>&1 & nohup ./redis_exporter -redis.addr 172.18.11.139:7001 -redis.password xxxxxx -web.listen-address 172.18.11.139:9122 > /dev/null 2>&1 & nohup ./redis_exporter -redis.addr 172.18.11.140:7002 -redis.password xxxxxx -web.listen-address 172.18.11.139:9123 > /dev/null 2>&1 & nohup ./redis_exporter -redis.addr 172.18.11.140:7003 -redis.password xxxxxx -web.listen-address 172.18.11.139:9124 > /dev/null 2>&1 & nohup ./redis_exporter -redis.addr 172.18.11.141:7004 -redis.password xxxxxx -web.listen-address 172.18.11.139:9125 > /dev/null 2>&1 & nohup ./redis_exporter -redis.addr 172.18.11.141:7005 -redis.password xxxxxx -web.listen-address 172.18.11.139:9126 > /dev/null 2>&1 &
最后只好去 github
提 issue
。用个人中国式英语和做者交流,终于明白了。。。其实官方文档已经写了。
scrape_configs: ## config for the multiple Redis targets that the exporter will scrape - job_name: 'redis_exporter_targets' static_configs: - targets: - redis://first-redis-host:6379 - redis://second-redis-host:6379 - redis://second-redis-host:6380 - redis://second-redis-host:6381 metrics_path: /scrape relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: <<REDIS-EXPORTER-HOSTNAME>>:9121 ## config for scraping the exporter itself - job_name: 'redis_exporter' static_configs: - targets: - <<REDIS-EXPORTER-HOSTNAME>>:9121
启动 redis_exporter
nohup ./redis_exporter -redis.password xxxxx &
重点
在 prometheus
里面如何配置:
- job_name: 'redis_exporter_targets' static_configs: - targets: - redis://172.18.11.139:7000 - redis://172.18.11.139:7001 - redis://172.18.11.140:7002 - redis://172.18.11.140:7003 - redis://172.18.11.141:7004 - redis://172.18.11.141:7005 metrics_path: /scrape relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: 172.18.11.139:9121 ## config for scraping the exporter itself - job_name: 'redis_exporter' static_configs: - targets: - 172.18.11.139:9121
这样子就能采集到集群的数据了。可是日志里提示
time="2019-12-17T09:10:49+08:00" level=error msg="Couldn't connect to redis instance"
午休的时候忽然想明白了,只要能链接到一个集群的一个节点,天然就能查询其余节点的指标了。因而启动命令改成:
nohup ./redis_exporter -redis.addr 172.18.11.141:7005 -redis.password xxxxx &
Prometheus
配置不变
送上几张图片:
groups: - name: Redis rules: - alert: RedisDown expr: redis_up == 0 for: 5m labels: severity: error annotations: summary: "Redis down (instance {{ $labels.instance }})" description: "Redis 挂了啊,mmp\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: MissingBackup expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24 for: 5m labels: severity: error annotations: summary: "Missing backup (instance {{ $labels.instance }})" description: "Redis has not been backuped for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: OutOfMemory expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 for: 5m labels: severity: warning annotations: summary: "Out of memory (instance {{ $labels.instance }})" description: "Redis is running out of memory (> 90%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: ReplicationBroken expr: delta(redis_connected_slaves[1m]) < 0 for: 5m labels: severity: error annotations: summary: "Replication broken (instance {{ $labels.instance }})" description: "Redis instance lost a slave\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: TooManyConnections expr: redis_connected_clients > 1000 for: 5m labels: severity: warning annotations: summary: "Too many connections (instance {{ $labels.instance }})" description: "Redis instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: NotEnoughConnections expr: redis_connected_clients < 5 for: 5m labels: severity: warning annotations: summary: "Not enough connections (instance {{ $labels.instance }})" description: "Redis instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: RejectedConnections expr: increase(redis_rejected_connections_total[1m]) > 0 for: 5m labels: severity: error annotations: summary: "Rejected connections (instance {{ $labels.instance }})" description: "Some connections to Redis has been rejected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"