Prometheus Alertmanagerhtml
概述node
Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager,而后Alertmanager将silencing、inhibition、aggregation等消息经过电子邮件、PaperDuty和HipChat发送通知。mysql
设置警报和通知的主要步骤:web
安装配置Alertmanager 配置Prometheus经过-alertmanager.url标志与Alertmanager通讯 在Prometheus中建立告警规则
Alertmanager简介及机制正则表达式
Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组,并将警报经过路由发送到正确的接收器,好比电子邮件、Slack等。Alertmanager还支持groups,silencing和警报抑制的机制。sql
分组数据库
分组是指将同一类型的警报分类为单个通知。当许多系统同时宕机时,颇有可能成百上千的警报会同时生成,这种机制特别有用。express
例如,当数十或数百个服务的实例在运行,网络发生故障时,有可能一半的服务实例不能访问数据库。在prometheus告警规则中配置为每个服务实例都发送警报的话,那么结果是数百警报被发送至Alertmanager。json
可是做为用户只想看到单一的报警页面,同时仍然可以清楚的看到哪些实例受到影响,所以,能够经过配置Alertmanager将警报分组打包,并发送一个相对看起来紧凑的通知。flask
分组警报、警报时间,以及接收警报的receiver是在alertmanager配置文件中经过路由树配置的。
抑制(Inhibition)
抑制是指当警报发出后,中止重复发送由此警报引起其余错误的警报的机制。(好比网络不可达,致使其余服务链接相关警报)
例如,当整个集群网络不可达,此时警报被触发,能够事先配置Alertmanager忽略由该警报触发而产生的全部其余警报,这能够防止通知数百或数千与此问题不相关的其余警报。
抑制机制也是经过Alertmanager的配置文件来配置。
沉默(Silences)
Silences是一种简单的特定时间不告警的机制。silences警告是经过匹配器(matchers)来配置,就像路由树同样。传入的警报会匹配RE,若是匹配,将不会为此警报发送通知。
这个可视化编辑器能够帮助构建路由树。
silences报警机制能够经过Alertmanager的Web页面进行配置。
Alermanager的配置
Alertmanager经过命令行flag和一个配置文件进行配置。命令行flag配置不变的系统参数、配置文件定义的抑制(inhibition)规则、通知路由和通知接收器。
要查看全部可用的命令行flag,运行alertmanager -h。
Alertmanager支持在运行时加载配置,若是新配置语法格式不正确,更改将不会被应用,并记录语法错误。经过向该进程发送SIGHUP或向/-/reload端点发送HTTP POST请求来触发配置热加载。
配置文件
要指定加载的配置文件,须要使用-config.file标志。该文件使用YAML来完成,经过下面的描述来定义。带括号的参数表示是可选的,对于非列表的参数的值,将被设置为指定的缺省值。
通用占位符定义解释:
<duration> : 与正则表达式匹配的持续时间值,[0-9]+(ms|[smhdwy]) <labelname>: 与正则表达式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]* <labelvalue>: unicode字符串 <filepath>: 有效的文件路径 <boolean>: boolean类型,true或者false <string>: 字符串 <tmpl_string>: 模板变量字符串
global全局配置文件参数在全部配置上下文生效,做为其余配置项的默认值,可被覆盖.
global: # ResolveTimeout is the time after which an alert is declared resolved # if it has not been updated. #解决报警时间间隔 [ resolve_timeout: <duration> | default = 5m ] # The default SMTP From header field. [ smtp_from: <tmpl_string> ] # The default SMTP smarthost used for sending emails. [ smtp_smarthost: <string> ] # SMTP authentication information. [ smtp_auth_username: <string> ] [ smtp_auth_password: <string> ] [ smtp_auth_secret: <string> ] # The default SMTP TLS requirement. [ smtp_require_tls: <bool> | default = true ] # The API URL to use for Slack notifications. [ slack_api_url: <string> ] [ pagerduty_url: <string> | default = "https://events.pagerduty.com/generic/2010-04-15/create_event.json" ] [ opsgenie_api_host: <string> | default = "https://api.opsgenie.com/" ] # Files from which custom notification template definitions are read. # The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'. templates: [ - <filepath> ... ] # The root node of the routing tree. route: <route> # A list of notification receivers. receivers: - <receiver> ... # A list of inhibition rules. inhibit_rules: [ - <inhibit_rule> ... ]
路由(route)
路由块定义了路由树及其子节点。若是没有设置的话,子节点的可选配置参数从其父节点继承。
每一个警报都会在配置的顶级路由中进入路由树,该路由树必须匹配全部警报(即没有任何配置的匹配器)。而后遍历子节点。若是continue的值设置为false,它在第一个匹配的子节点以后就中止;若是continue的值为true,警报将继续进行后续子节点的匹配。若是警报不匹配任何节点的任何子节点(没有匹配的子节点,或不存在),该警报基于当前节点的配置处理。
路由配置格式
#报警接收器 [ receiver: <string> ] #分组 [ group_by: '[' <labelname>, ... ']' ] # Whether an alert should continue matching subsequent sibling nodes. [ continue: <boolean> | default = false ] # A set of equality matchers an alert has to fulfill to match the node. #根据匹配的警报,指定接收器 match: [ <labelname>: <labelvalue>, ... ] # A set of regex-matchers an alert has to fulfill to match the node. match_re: #根据匹配正则符合的警告,指定接收器 [ <labelname>: <regex>, ... ] # How long to initially wait to send a notification for a group # of alerts. Allows to wait for an inhibiting alert to arrive or collect # more initial alerts for the same group. (Usually ~0s to few minutes.) [ group_wait: <duration> ] # How long to wait before sending notification about new alerts that are # in are added to a group of alerts for which an initial notification # has already been sent. (Usually ~5min or more.) [ group_interval: <duration> ] # How long to wait before sending a notification again if it has already # been sent successfully for an alert. (Usually ~3h or more). [ repeat_interval: <duration> ] # Zero or more child routes. routes: [ - <route> ... ]
例子:
# The root route with all parameters, which are inherited by the child # routes if they are not overwritten. route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] # All alerts that do not match the following child routes # will remain at the root node and be dispatched to 'default-receiver'. routes: # All alerts with service=mysql or service=cassandra # are dispatched to the database pager. - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra # All alerts with the team=frontend label match this sub-route. # They are grouped by product and environment rather than cluster # and alertname. - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend
抑制规则 inhibit_rule
抑制规则,是存在另外一组匹配器匹配的状况下,使其余被引起警报的规则静音。这两个警报,必须有一组相同的标签。
抑制配置格式
# Matchers that have to be fulfilled in the alerts to be muted. ##必须在要须要静音的警报中履行的匹配者 target_match: [ <labelname>: <labelvalue>, ... ] target_match_re: [ <labelname>: <regex>, ... ] # Matchers for which one or more alerts have to exist for the # inhibition to take effect. #必须存在一个或多个警报以使抑制生效的匹配者。 source_match: [ <labelname>: <labelvalue>, ... ] source_match_re: [ <labelname>: <regex>, ... ] # Labels that must have an equal value in the source and target # alert for the inhibition to take effect. #在源和目标警报中必须具备相等值的标签才能使抑制生效 [ equal: '[' <labelname>, ... ']' ]
接收器(receiver)
顾名思义,警报接收的配置。
通用配置格式
# The unique name of the receiver. name: <string> # Configurations for several notification integrations. email_configs: [ - <email_config>, ... ] pagerduty_configs: [ - <pagerduty_config>, ... ] slack_config: [ - <slack_config>, ... ] opsgenie_configs: [ - <opsgenie_config>, ... ] webhook_configs: [ - <webhook_config>, ... ]
邮件接收器email_config
# Whether or not to notify about resolved alerts. #警报被解决以后是否通知 [ send_resolved: <boolean> | default = false ] # The email address to send notifications to. to: <tmpl_string> # The sender address. [ from: <tmpl_string> | default = global.smtp_from ] # The SMTP host through which emails are sent. [ smarthost: <string> | default = global.smtp_smarthost ] # The HTML body of the email notification. [ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] # Further headers email header key/value pairs. Overrides any headers # previously set by the notification implementation. [ headers: { <string>: <tmpl_string>, ... } ]
Slcack接收器slack_config
# Whether or not to notify about resolved alerts. [ send_resolved: <boolean> | default = true ] # The Slack webhook URL. [ api_url: <string> | default = global.slack_api_url ] # The channel or user to send notifications to. channel: <tmpl_string> # API request data as defined by the Slack webhook API. [ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ] [ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}' [ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ] [ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ] [ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ] [ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ] [ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]
Webhook接收器webhook_config
# Whether or not to notify about resolved alerts. [ send_resolved: <boolean> | default = true ] # The endpoint to send HTTP POST requests to. url: <string>
Alertmanager会使用如下的格式向配置端点发送HTTP POST请求:
{ "version": "3", "groupKey": <number> // key identifying the group of alerts (e.g. to deduplicate) "status": "<resolved|firing>", "receiver": <string>, "groupLabels": <object>, "commonLabels": <object>, "commonAnnotations": <object>, "externalURL": <string>, // backling to the Alertmanager. "alerts": [ { "labels": <object>, "annotations": <object>, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>" }, ... ] }
能够添加一个钉钉webhook,经过钉钉报警,因为POST数据须要有要求,简单实现一个数据转发脚本。
from flask import Flask from flask import request import json app = Flask(__name__) @app.route('/',methods=['POST']) def send(): if request.method == 'POST': post_data = request.get_data() alert_data(post_data) return def alert_data(data): from urllib2 import Request,urlopen url = 'https://oapi.dingtalk.com/robot/send?access_token=xxxx' send_data = '{"msgtype": "text","text": {"content": %s}}' %(data) request = Request(url, send_data) request.add_header('Content-Type','application/json') return urlopen(request).read() if __name__ == '__main__': app.run(host='0.0.0.0')
报警规则
报警规则容许你定义基于Prometheus表达式语言的报警条件,并发送报警通知到外部服务
定义报警规则
报警规则经过如下格式定义:
ALERT <alert name> IF <expression> [ FOR <duration> ] [ LABELS <label set> ] [ ANNOTATIONS <label set> ]
可选的FOR语句,使得Prometheus在表达式输出的向量元素(例如高HTTP错误率的实例)之间等待一段时间,将警报计数做为触发此元素。若是元素是active,可是没有firing的,就处于pending状态。
LABELS(标签)语句容许指定一组标签附加警报上。将覆盖现有冲突的任何标签,标签值也能够被模板化。
ANNOTATIONS(注释)它们被用于存储更长的其余信息,例如警报描述或者连接,注释值也能够被模板化。
Templating(模板) 标签和注释值可使用控制台模板进行模板化。$labels变量保存警报实例的标签键/值对,$value保存警报实例的评估值。
# To insert a firing element's label values: {{ $labels.<labelname> }} # To insert the numeric expression value of the firing element: {{ $value }}
报警规则示例:
# Alert for any instance that is unreachable for >5 minutes. ALERT InstanceDown IF up == 0 FOR 5m LABELS { severity = "page" } ANNOTATIONS { summary = "Instance {{ $labels.instance }} down", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.", } # Alert for any instance that have a median request latency >1s. ALERT APIHighRequestLatency IF api_http_request_latencies_second{quantile="0.5"} > 1 FOR 1m ANNOTATIONS { summary = "High request latency on {{ $labels.instance }}", description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)", }
运行时检查警报
要手动检查处于active状态(pending或者firing)的警报,可在Prometheus实例web导航窗口的"alert"选项卡查看.
For pending and firing alerts, Prometheus also stores synthetic time series of the form ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}. The sample value is set to 1 as long as the alert is in the indicated active (pending or firing) state, and a single 0 value gets written out when an alert transitions from active to inactive state. Once inactive, the time series does not get further updates.
发送报警通知
Prometheus的警报rules能够很好的知道如今的故障状况,但还不是一个完整的通知解决方案。在简单的警报定义之上,须要另外一层级来实现报警汇总,通知速率限制,silences等基于rules之上,在prometheus生态系统中,Alertmanager发挥了这一做用。所以,
Prometheus能够周期性的发送关于警报状态的信息到Alertmanager实例,而后Alertmanager调度来发送正确的通知。该Alertmanager能够经过-alertmanager.url命令行flag来配置。
连接:https://www.jianshu.com/p/239b145e2acc https://www.jianshu.com/p/b9dcdaa117c7