k8s集群使用普罗官方文档html
下载二进制https://prometheus.io/download/node
https://github.com/prometheus/prometheus/releases/download/v2.0.0/prometheus-2.0.0.windows-amd64.tar.gz https://github.com/prometheus/alertmanager/releases/download/v0.12.0/alertmanager-0.12.0.windows-amd64.tar.gz https://github.com/prometheus/node_exporter/releases/download/v0.15.2/node_exporter-0.15.2.linux-amd64.tar.gz
解压mysql
/root/ ├── alertmanager -> alertmanager-0.12.0.linux-amd64 ├── alertmanager-0.12.0.linux-amd64 ├── alertmanager-0.12.0.linux-amd64.tar.gz ├── node_exporter-0.15.2.linux-amd64 ├── node_exporter-0.15.2.linux-amd64.tar.gz ├── prometheus -> prometheus-2.0.0.linux-amd64 ├── prometheus-2.0.0.linux-amd64 └── prometheus-2.0.0.linux-amd64.tar.gz
建立 alert.ymllinux
[root@n1 alertmanager]# ls alertmanager alert.yml amtool data LICENSE NOTICE simple.yml
alert.yml 里面定义下: 谁发送 什么事件 发给谁 怎么发等.git
cat alert.yml global: smtp_smarthost: 'smtp.163.com:25' smtp_from: 'maotai@163.com' smtp_auth_username: 'maotai@163.com' smtp_auth_password: '123456' templates: - '/root/alertmanager/template/*.tmpl' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 10m receiver: default-receiver receivers: - name: 'default-receiver' email_configs: - to: 'maotai@foxmail.com' - 配置好后启动便可 ./alertmanager -config.file=./alert.yml
当使用率大于2%时候(测试),发邮件报警github
$ cat rule.yml groups: - name: test-rule rules: - alert: NodeMemoryUsage expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
关键在于这个公式sql
(node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2
labels 给这个规则打个标签windows
annotations(报警说明)这部分是报警内容api
监控k从哪里获取?(后面有说) node_memory_MemTotal/node_memory_Buffers/node_memory_Cached微信
添加node_expolore这个job
添加rule_files的报警规则,rule_files部分调用rule.yml
$ cat prometheus.yml global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] rule_files: - /root/prometheus/rule.yml scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['192.168.14.11:9090'] - job_name: linux static_configs: - targets: ['192.168.14.11:9100'] labels: instance: db1
配置好后启动普罗而后访问,能够看到了node target了.
查看node_explore抛出的metric
查看alert,能够看到告警规则发生的状态
这些公式的key从这里能够看到(前提是当你安装了对应的explore),按照这个k来写告警公式
global: # The smarthost and SMTP sender used for mail notifications. resolve_timeout: 6m smtp_smarthost: '172.16.100.14:25' smtp_from: 'svnbuild_yf@iflytek.com' smtp_auth_username: 'svnbuild_yf' smtp_auth_password: 'tag#write@2015313' smtp_require_tls: false # The auth token for Hipchat. hipchat_auth_token: '1234556789' # Alternative host for Hipchat. hipchat_api_url: 'https://hipchat.foobar.org/' wechat_api_url: "https://qyapi.weixin.qq.com/cgi-bin/" wechat_api_secret: "4tQroVeB0xUcccccccc65Yfkj2Nkt90a80MH3ayI" wechat_api_corp_id: "wxaf5acxxxx5f8eb98" # The directory from which notification templates are read. templates: - 'templates/*.tmpl' # The root route on which each incoming alert enters. route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 3s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 1h # A default receiver receiver: ybyang2 routes: - match: job: "11" #service: "node_exporter" routes: - match: status: yellow receiver: ybyang2 - match: status: orange receiver: berlin # Inhibition rules allow to mute a set of alerts given that another alert is # firing. # We use this to mute any warning-level notifications if the same alert is # already critical. inhibit_rules: - source_match: service: 'up' target_match: service: 'mysql' # Apply inhibition if the alerqtname is the same. equal: ["instance"] - source_match: service: "mysql" target_match: service: "mysql-query" equal: ['instance'] - source_match: service: "A" target_match: service: "B" equal: ["instance"] - source_match: service: "B" target_match: service: "C" equal: ["instance"] receivers: - name: 'ybyang2' email_configs: - to: 'ybyang2@iflytek.com' send_resolved: true html: '{{ template "email.default.html" . }}' headers: { Subject: "[mail] 测试技术部监控告警邮件" } - name: "berlin" wechat_configs: - send_resolved: true to_user: "@all" to_party: "" to_tag: "" agent_id: "1" corp_id: "wxaf5a99ccccc5f8eb98"