alertmanager 发钉钉告警报400错误

背景

基于Prometheus+prometheus-webhook-dingtalk+Alertmanager监控Pulsar并经过钉钉发告警,当pulsar出现积压,或者发生故障,能够第一时间处理解决。具体的安装方法,请查看以前的博客《基于Prometheus+Grafana+Alertmanager监控Pulsar发钉钉告警node

现象

在非生产环境,pulsar的积压可以正常告警,生产环境积压告警值,发现触发积压告警,可是钉钉就没法收到告警。web

 

排查步骤

首先,想到的是会不会Alertmanager告警会有问题,能够先测试下Alertmanager是否能够正常发出告警,在告警规则配置文件中,修改告警值0变成1bash

groups:
  - name: node
    rules:     
      - alert: InstanceDown
        expr: up == 1
        for: 1m
        labels:
          status: danger
        annotations:
          summary: "Instance {{ $labels.instance }} down."
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

查看钉钉告警,收到了告警,那说明告警功能是正常的ide

watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

接着检查全部的组件配置文件,Prometheus,prometheus-webhook-dingtalk,Alertmanager,配置信息都是正常的,将全部的组件重启一遍,问题仍然存在,在重启的时候,查看各组件的日志,Prometheus打印日志都是正常的,prometheus-webhook-dingtalk的日志有点异常测试

level=info ts=2021-08-12T06:12:23.622Z caller=entry.go:22 component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=10.7.7.48:7510 user_agent=Alertmanager/0.21.0 uri=http://10.7.7.28:8060/dingtalk/webhook1/send resp_status=400 resp_bytes_length=27 resp_elapsed_ms=228.690575 msg="request complete"
level=error ts=2021-08-12T06:12:33.620Z caller=dingtalk.go:103 component=web target=webhook1 msg="Failed to send notification to DingTalk" respCode=460101 respMsg="message too long, exceed 20000 bytes"

msg="Failed to send notification to DingTalk" respCode=460101 respMsg="message too long, exceed 20000 bytes",说是告警消息过长,过大,致使了消息没法发送成功日志

到Prometheus控制台去检查下监控项信息,code

watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

触发的结果有60个,还有lable信息好多,一次性要把这些信息都发出,超过了钉钉的大小限制,才致使钉钉没法接收到告警信息component

解决方法

知道是因为告警值的内容过大这个缘由,那就好办了,只要把告警值的内容减少,就能够正常的发出告警了,在规则配置文件中使用promQL语句来过虑掉一些不要值blog

- alert: TooManyBacklogsOnTopic
        expr: pulsar_msg_backlog{job="node-broker"} > 40000
        for: 30s
        labels:
          status: warning
        annotations:
          #summary: "Backlogs of topic are more than 50000."
          #description: "Backlogs of topic {{ $labels.topic }} is more than 50000 , current value is {{ $value }}."

根据job精确匹配出job的值,job的值请根据本身具体的值修改,把一些没必要要的值过虑掉,如今匹配出来的结果为30个,是原来的一半ip

watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

咱们再确认钉钉是否能够正常收到告警了

watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=

钉钉收到告警了,解决!

相关文章
相关标签/搜索