使用kube-state-metrics监控kubernetes并微信告警

前言

监控指标 具体实现 举例
Pod性能 cAdvisor 容器CPU,内存利用率
Node性能 node-exporter 节点CPU,内存利用率
K8S资源对象 kube-state-metrics Pod/Deployment/Service

数据收集

咱们这里使用kube-state-metricsk8s资源数据进行收集。node

架构图

使用kube-state-metrics监控kubernetes并微信告警

监控指标

指标类别包括:python

  • CronJob Metrics
  • DaemonSet Metrics
  • Deployment Metrics
  • Job Metrics
  • LimitRange Metrics
  • Node Metrics
  • PersistentVolume Metrics
  • PersistentVolumeClaim Metrics
  • Pod Metrics
  • Pod Disruption Budget Metrics
  • ReplicaSet Metrics
  • ReplicationController Metrics
  • ResourceQuota Metrics
  • Service Metrics
  • StatefulSet Metrics
  • Namespace Metrics
  • Horizontal Pod Autoscaler Metrics
  • Endpoint Metrics
  • Secret Metrics
  • ConfigMap Metrics

以pod为例:git

  • kube_pod_info
  • kube_pod_owner
  • kube_pod_status_phase
  • kube_pod_status_ready
  • kube_pod_status_scheduled
  • kube_pod_container_status_waiting
  • kube_pod_container_status_terminated_reason
  • ...

部署 kube-state-metrics

默认会在kube-system命名空间下建立对应的资源,最好不要更换yaml文件中的命名空间。web

# 获取yml文件
git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git
# 部署kube-state-metrics
kubectl apply -f kube-state-metrics-configs
# 查看pod状态
kubectl get pod -n kube-system
NAME                                   READY   STATUS    RESTARTS   AGE
kube-state-metrics-c698dc7b5-zstz9     1/1     Running   0          32m

获取kube-state-metrics-c698dc7b5-zstz9容器日志,如图:api

使用kube-state-metrics监控kubernetes并微信告警

数据对接prometheus

prometheus准备好了以后,添加对应的采集job便可:微信

# 添加在sidecar/cm-kube-mon-sidecar.yaml最后
- job_name: 'kube-state-metrics'
      static_configs:
        - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
# 从新apply configmap文件,无需重启prometheus(配置的有热加载)
kubectl apply -f sidecar/cm-kube-mon-sidecar.yaml

打开promethues的web看一下target里的配置是否生效,以下图:架构

使用kube-state-metrics监控kubernetes并微信告警

数据对接Grafana

Grafana准备好以后,咱们在Grafana中导入选定/自定义的dashboard,添加Prometheus数据源,便可:app

使用kube-state-metrics监控kubernetes并微信告警

prometheus报警规则

修改sidecar/rules-cm-kube-mon-sidecar.yaml配置文件,添加以下报警指标。负载均衡

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-mon
data:
  alert-rules.yaml: |-
    groups:
    - name: White box monitoring
      rules:
      - alert: Pod-重启
        expr: changes(kube_pod_container_status_restarts_total{pod !~ "analyzer.*"}[10m]) > 0
        for: 1m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Pod: {{ $labels.pod }}  Restart"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          pod: "{{ $labels.pod }}"
          container: "{{ $labels.container }}"

      - alert: Pod-未知错误/失败
        expr: kube_pod_status_phase{phase="Unknown"} == 1 or kube_pod_status_phase{phase="Failed"} == 1
        for: 1m
        labels:
          severity: 紧急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Pod: {{ $labels.pod }} 未知错误/失败"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}" 
          pod: "{{ $labels.pod }}"
          container: "{{ $labels.container }}"

      - alert: Daemonset Unavailable
        expr: kube_daemonset_status_number_unavailable > 0
        for: 5m
        labels:
          severity: 紧急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Daemonset: {{ $labels.daemonset }} 守护进程不可用"
          k8scluster: "{{ $labels.k8scluster}}" 
          namespace: "{{ $labels.namespace }}"  
          daemonset: "{{ $labels.daemonset }}" 

      - alert: Job-失败
        expr: kube_job_status_failed == 1
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Job: {{ $labels.job_name }} Failed"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          job: "{{ $labels.job_name }}"

      - alert: Pod NotReady
        expr: sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id)(kube_pod_status_phase{job=~".*kubernetes-service-endpoints",phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id)group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod,owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "pod: {{ $labels.pod }} 处于 NotReady 状态超过15分钟"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"

      - alert: Deployment副本数
        expr: (kube_deployment_spec_replicas{job=~".*kubernetes-service-endpoints"} !=kube_deployment_status_replicas_available{job=~".*kubernetes-service-endpoints"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0)
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Deployment: {{ $labels.deployment }} 实际副本数和设置副本数不一致"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          deployment: "{{ $labels.deployment }}"

      - alert: Statefulset副本数
        expr: (kube_statefulset_status_replicas_ready{job=~".*kubernetes-service-endpoints"} !=kube_statefulset_status_replicas{job=~".*kubernetes-service-endpoints"}) and (changes(kube_statefulset_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0)
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Statefulset: {{ $labels.statefulset }} 实际副本数和设置副本数不一致"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          statefulset: "{{ $labels.statefulset }}"

      - alert: 存储卷PV
        expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kubernetes-service-endpoints"} > 0
        for: 5m
        labels:
          severity: 紧急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "存储卷PV: {{ $labels.persistentvolume }} 处于Failed或Pending状态"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          persistentvolume: "{{ $labels.persistentvolume }}"

      - alert: 存储卷PVC
        expr: kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost",job=~".*kubernetes-service-endpoints"} > 0
        for: 5m
        labels:
          severity: 紧急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "存储卷PVC: {{ $labels.persistentvolumeclaim }} Failed或Pending状态"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          persistentvolumeclaim: "{{ $labels.persistentvolumeclaim }}"

      - alert: k8s service
        expr: kube_service_status_load_balancer_ingress != 1
        for: 5m
        labels:
          severity: 紧急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Service: {{ $labels.service }} 服务负载均衡器入口状态DOWN!"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          persistentvolumeclaim: "{{ $labels.service }}"

更新sidecar/rules-cm-kube-mon-sidecar.yaml配置文件,以下:ide

# 稍等待一分钟左右,prometheus已定义热更新,无需apply
kubectl apply -f rules-cm-kube-mon-sidecar.yaml

对接AlertManager

部署微信告警

# 获取yml文件
git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git && cd thanos/AlertManager
# 部署AlertManager
## 更改成本身wechat信息
kubectl apply -f cm-kube-mon-alertmanager.yaml
kubectl apply -f wechat-template-kube-mon.yaml
kubectl apply -f deploy-kube-mon-alertmanager.yaml
kubectl apply -f svc-kube-mon-alertmanager.yaml

查看报警状态

使用kube-state-metrics监控kubernetes并微信告警

微信报警和恢复信息

报警信息

使用kube-state-metrics监控kubernetes并微信告警

恢复信息

使用kube-state-metrics监控kubernetes并微信告警