Prometheus Operator自定义监控项

Prometheus Operator默认的监控指标并不能彻底知足实际的监控需求,这时候就须要咱们本身根据业务添加自定义监控。添加一个自定义监控的步骤以下:
一、建立一个ServiceMonitor对象,用于Prometheus添加监控项
二、为ServiceMonitor对象关联metrics数据接口的Service对象
三、确保Services对象能够正确获取到metrics数据node

下面本文将以如何添加redis监控为例web

部署redis

k8s-redis-and-exporter-deployment.yamlredis

---
apiVersion: v1
kind: Namespace
metadata:
  name: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: redis
  name: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9121"
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 6379
      - name: redis-exporter
        image: oliver006/redis_exporter:latest
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 9121

部署redis的同时,咱们把redis_exporter以sidecar的形式和redis服务部署在用一个Pod
另外注意,咱们添加了annotations:prometheus.io/scrape: "true" 和 prometheus.io/port: "9121"shell

建立 Redis Service

apiVersion: v1
kind: Service
metadata:
  name: redis-svc
  namespace: redis
  labels:
    app: redis
spec:
  type: NodePort
  ports:
  - name: redis
    port: 6379
    targetPort: 6379
  - name: redis-exporter
    port: 9121
    targetPort: 9121
  selector:
    app: redis

检查下部署好的服务并验证metrics可以获取到数据api

[root@]# kubectl get po,ep,svc -n redis
NAME                         READY   STATUS    RESTARTS   AGE
pod/redis-78446485d8-sp57x   2/2     Running   0          116m

NAME                  ENDPOINTS                               AGE
endpoints/redis-svc   100.102.126.3:9121,100.102.126.3:6379   6m5s

NAME                TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)                         AGE
service/redis-svc   NodePort   10.105.111.177   <none>        6379:32357/TCP,9121:31019/TCP   6m5s

验证metrics
[root@qd01-stop-k8s-master001 MyDefine]# curl 10.105.111.177:9121/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
............

建立 ServiceMonitor

如今 Prometheus 访问redis,接下来建立 ServiceMonitor 对象便可
prometheus-serviceMonitorRedis.yaml微信

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: redis-k8s
  namespace: monitoring
  labels:
    app: redis
spec:
  jobLabel: redis
  endpoints:
  - port: redis-exporter
    interval: 30s
    scheme: http
  selector:
    matchLabels:
      app: redis
  namespaceSelector:
    matchNames:
    - redis

执行建立并查看-serviceMonitorapp

[root@]# kubectl apply -f prometheus-serviceMonitorRedis.yaml
servicemonitor.monitoring.coreos.com/redis-k8s created

[root@]# kubectl get serviceMonitor -n monitoring
NAME                      AGE
redis-k8s                 11s

如今切换到PrometheusUI界面查看targets,会发现多了刚才建立的redis-k8s监控项
Prometheus Operator自定义监控项
如今就能够查询redis-exporter收集到的redis监控指标了
Prometheus Operator自定义监控项curl

配置 PrometheusRule

咱们如今能收集到redis的监控指标了,可是如今并无配置监控报警规则。须要咱们本身根据实际关心的指标添加报警规则
首先咱们看下Prometheus默认的规则,大概以下。
Prometheus Operator自定义监控项ide

如今咱们就来为redis添加一条规则,在 Prometheus的 Config 页面下面查看关于 AlertManager 的配置:
Prometheus Operator自定义监控项ui

上面 alertmanagers 实例的配置咱们能够看到是经过角色为 endpoints 的 kubernetes 的服务发现机制获取的,匹配的是服务名为 alertmanager-main,端口名未 web 的 Service 服务,咱们查看下 alertmanager-main 这个 Service:

[root@]# kubectl describe svc alertmanager-main -n monitoring
Name:              alertmanager-main
Namespace:         monitoring
Labels:            alertmanager=main
Annotations:       <none>
Selector:          alertmanager=main,app=alertmanager
Type:              ClusterIP
IP:                10.111.141.65
Port:              web  9093/TCP
TargetPort:        web/TCP
Endpoints:         100.118.246.1:9093,100.64.147.129:9093,100.98.81.194:9093
Session Affinity:  ClientIP
Events:            <none>

能够看到服务名就是 alertmanager-main,Port 定义的名称也是 web,符合上面的规则,因此 Prometheus 和 AlertManager 组件就正确关联上了。而对应的报警规则文件位于:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/目录下面全部的 YAML 文件。能够进入 Prometheus 的 Pod 中验证下该目录下面是否有 YAML 文件:
Prometheus Operator自定义监控项
这个YAML文件实际上就是咱们以前建立的一个 PrometheusRule 文件包含的:
Prometheus Operator自定义监控项
这里的 PrometheusRule 的 name 为 prometheus-k8s-rules,namespace 为 monitoring,咱们能够猜测到咱们建立一个 PrometheusRule 资源对象后,会自动在上面的 prometheus-k8s-rulefiles-0 目录下面生成一个对应的<namespace>-<name>.yaml文件,因此若是之后咱们须要自定义一个报警选项的话,只须要定义一个 PrometheusRule 资源对象便可。至于为何 Prometheus 可以识别这个 PrometheusRule 资源对象呢?这就查看咱们建立的 prometheus( prometheus-prometheus.yaml) 这个资源对象了,里面有很是重要的一个属性 ruleSelector,用来匹配 rule 规则的过滤器,要求匹配具备 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 资源对象,如今明白了吧?

ruleSelector:
    matchLabels:
      prometheus: k8s
      role: alert-rules

因此要想自定义一个报警规则,只须要建立一个具备 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就好了,好比如今咱们添加一个redis是否可用的报警,咱们能够经过redis_up这个指标检查redis是否启动,建立文件 prometheus-redisRules.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: redis-rules
  namespace: monitoring
spec:
  groups:
  - name: redis
    rules:
    - alert: RedisUnavailable
      annotations:
        summary: redis instance info
        description: If redis_up == 0, redis will be unavailable
      expr: |
        redis_up == 0
      for: 3m
      labels:
        severity: critical

建立prometheusrule后,能够看到咱们本身建立的redis-rules

kubectl apply -f prometheus-redisRules.yaml

kubectl get prometheusrule -n monitoring
NAME                   AGE
etcd-rules             4d18h
prometheus-k8s-rules   17d
redis-rules            15s

注意 label 标签必定至少要有 prometheus=k8s 或 role=alert-rules,建立完成后,隔一下子再去容器中查看下 rules 文件夹:
Prometheus Operator自定义监控项
如今看到咱们建立的 rule 文件已经被注入到了对应的 rulefiles 文件夹下面了。而后再去 Prometheus的 Alert 页面下面就能够查看到上面咱们新建的报警规则了:
Prometheus Operator自定义监控项

配置报警

如今咱们知道了怎么去添加一个报警规则配置项,可是这些报警信息用怎样的方式去发送呢?
这个就须要咱们配置alertmanager
这里我以邮件和微信为例

alertmanager的配置文件alertmanager.yaml使用 alertmanager-secret.yaml 文件建立,这里看下默认的配置
cat alertmanager-secret.yaml

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-main
  namespace: monitoring
stringData:
  alertmanager.yaml: |-
    "global":
      "resolve_timeout": "5m"
    "inhibit_rules":
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "critical"
      "target_match_re":
        "severity": "warning|info"
    - "equal":
      - "namespace"
      - "alertname"
      "source_match":
        "severity": "warning"
      "target_match_re":
        "severity": "info"
    "receivers":
    - "name": "Default"
    - "name": "Watchdog"
    - "name": "Critical"
    "route":
      "group_by":
      - "namespace"
      "group_interval": "5m"
      "group_wait": "30s"
      "receiver": "Default"
      "repeat_interval": "12h"
      "routes":
      - "match":
          "alertname": "Watchdog"
        "receiver": "Watchdog"
      - "match":
          "severity": "critical"
        "receiver": "Critical"
type: Opaque

如今咱们须要修改这个文件,配置微信和邮件相关信息,前提你须要自行准备好企业微信相关信息,能够自行网上搜相关教程。
首先建立alertmanager.yaml文件

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.51os.club:25'
  smtp_from: 'amos'
  smtp_auth_username: 'amos@51os.club'
  smtp_auth_password: 'Mypassword'
  smtp_hello: '51os.club'
  smtp_require_tls: false
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'SGGc4x-RDcVD_ptvVhYrxxxxxxxxxxOhWVWIITRxM'
  wechat_api_corp_id: 'ww419xxxxxxxx735e1c0'

templates:
- '*.tmpl'

route:
  group_by: ['job', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default
  routes:
  - receiver: wechat
    continue: true
    match:
      alertname: Watchdog

receivers:
- name: 'default'
  email_configs:
  - to: '10xxxx1648@qq.com'
    send_resolved: true
- name: 'wechat'
  wechat_configs:
  - send_resolved: false
    corp_id: 'ww419xxxxxxxx35e1c0'
    to_party: '13'
    message: '{{ template "wechat.default.message" . }}'
    agent_id: '1000003'
    api_secret: 'SGGc4x-RDcxxxxxxxxY6YwfZFsO9OhWVWIITRxM'

我这里添加了两个接收器,默认的经过邮箱进行发送,对于 Watchdog 这个报警咱们经过 webhook 来进行发送,这个 webhook 就是wechat。

说明我这里偷懒,由于如今系统恰好有一个报警Watchdog,因此我这里匹配了 Watchdog 这个报警,固然您能够换成咱们自定义的redis的监控RedisUnavailable

Prometheus Operator自定义监控项

而后使用在建立一个templates文件,这个文件是发微信消息的模板wechat.tmpl:

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
AlertTpye: {{ $alert.Labels.alertname }}
AlertLevel: {{ $alert.Labels.severity }}

=====================
{{- end }}
===Alert Info===
Alert Info: {{ $alert.Annotations.message }}
Alert Time: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
===More Info===
{{ if gt (len $alert.Labels.instance) 0 -}}InstanceIp: {{ $alert.Labels.instance }};{{- end -}}
{{- if gt (len $alert.Labels.namespace) 0 -}}InstanceNamespace: {{ $alert.Labels.namespace }};{{- end -}}
{{- if gt (len $alert.Labels.node) 0 -}}NodeIP: {{ $alert.Labels.node }};{{- end -}}
{{- if gt (len $alert.Labels.pod_name) 0 -}}PodName: {{ $alert.Labels.pod_name }}{{- end }}
=====================
{{- end }}
{{- end }}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
AlertTpye: {{ $alert.Labels.alertname }}
AlertLevel: {{ $alert.Labels.severity }}

=====================
{{- end }}
===Alert Info===
Alert Info: {{ $alert.Annotations.message }}
Alert Start Time: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
Alert Fix Time: {{ $alert.EndsAt.Format "2006-01-02 15:04:05" }}
===More Info===
{{ if gt (len $alert.Labels.instance) 0 -}}InstanceIp: {{ $alert.Labels.instance }};{{- end -}}
{{- if gt (len $alert.Labels.namespace) 0 -}}InstanceNamespace: {{ $alert.Labels.namespace }};{{- end -}}
{{- if gt (len $alert.Labels.node) 0 -}}NodeIP: {{ $alert.Labels.node }};{{- end -}}
{{- if gt (len $alert.Labels.pod_name) 0 -}}PodName: {{ $alert.Labels.pod_name }};{{- end }}
=====================
{{- end }}
{{- end }}
{{- end }}

如今咱们先删除原来的 alertmanager-main secret,而后再基于alertmanager.yaml和wechat.tmpl建立alertmanager-main secret

kubectl delete secret alertmanager-main -n monitoring
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring

上面的步骤建立完成后,很快咱们就会收到一条wechat消息,一样邮箱中也会收到报警信息:
Prometheus Operator自定义监控项

再次查看 AlertManager 的配置信息能够看到已经变成上面咱们的配置信息了