第一章和第二章中咱们配置Prometheus的成本很是高,并且也很是麻烦。可是咱们要考虑Prometheus、AlertManager 这些组件服务自己的高可用的话,成本就更高了,固然咱们也彻底能够用自定义的方式来实现这些需求,咱们也知道 Promethues 在代码上就已经对 Kubernetes 有了原生的支持,能够经过服务发现的形式来自动监控集群,所以咱们可使用另一种更加高级的方式来部署 Prometheus:Operator
框架。node
Operator
Operator是由CoreOS开发的,用来扩展Kubernetes API,特定的应用程序控制器,它用来建立、配置和管理复杂的有状态应用,如数据库、缓存和监控系统。Operator基于Kubernetes的资源和控制器概念之上构建,但同时又包含了应用程序特定的领域知识。建立Operator的关键是CRD(自定义资源)的设计。linux
Operator
是将运维人员对软件操做的知识给代码化,同时利用 Kubernetes 强大的抽象来管理大规模的软件应用。目前CoreOS
官方提供了几种Operator
的实现,其中就包括咱们今天的主角:Prometheus Operator
,Operator
的核心实现就是基于 Kubernetes 的如下两个概念:git
当前CoreOS提供的如下四种Operator:github
接下来咱们将使用Operator建立Prometheus。
web
咱们这里直接经过 Prometheus-Operator 的源码来进行安装,固然也能够用 Helm 来进行一键安装,咱们采用源码安装能够去了解更多的实现细节。首页将源码 Clone 下来:shell
git clone https://github.com/coreos/prometheus-operator cd prometheus-operator/contrib/kube-prometheus/manifests
进入到 manifests 目录下面,这个目录下面包含咱们全部的资源清单文件,直接在该文件夹下面执行建立资源命令便可:数据库
kubectl apply -f .
部署完成后,会建立一个名为monitoring
的 namespace,因此资源对象对将部署在改命名空间下面,此外 Operator 会自动建立4个 CRD 资源对象:vim
kubectl get crd |grep coreos alertmanagers.monitoring.coreos.com 2019-03-18T02:43:57Z prometheuses.monitoring.coreos.com 2019-03-18T02:43:58Z prometheusrules.monitoring.coreos.com 2019-03-18T02:43:58Z servicemonitors.monitoring.coreos.com 2019-03-18T02:43:58Z
能够在 monitoring 命名空间下面查看全部的 Pod,其中 alertmanager 和 prometheus 是用 StatefulSet 控制器管理的,其中还有一个比较核心的 prometheus-operator 的 Pod,用来控制其余资源对象和监听对象变化的:后端
kubectl get pods -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 37m alertmanager-main-1 2/2 Running 0 34m alertmanager-main-2 2/2 Running 0 33m grafana-7489c49998-pkl8w 1/1 Running 0 40m kube-state-metrics-d6cf6c7b5-7dwpg 4/4 Running 0 27m node-exporter-dlp25 2/2 Running 0 40m node-exporter-fghlp 2/2 Running 0 40m node-exporter-mxwdm 2/2 Running 0 40m node-exporter-r9v92 2/2 Running 0 40m prometheus-adapter-84cd9c96c9-n92n4 1/1 Running 0 40m prometheus-k8s-0 3/3 Running 1 37m prometheus-k8s-1 3/3 Running 1 37m prometheus-operator-7b74946bd6-vmbcj 1/1 Running 0 40m
查看建立的 Service:api
kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main ClusterIP 10.110.43.207 <none> 9093/TCP 40m alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 38m grafana ClusterIP 10.109.160.0 <none> 3000/TCP 40m kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 40m node-exporter ClusterIP None <none> 9100/TCP 40m prometheus-adapter ClusterIP 10.105.174.21 <none> 443/TCP 40m prometheus-k8s ClusterIP 10.97.195.143 <none> 9090/TCP 40m prometheus-operated ClusterIP None <none> 9090/TCP 38m prometheus-operator ClusterIP None <none> 8080/TCP 40m
能够看到上面针对 grafana 和 prometheus 都建立了一个类型为 ClusterIP 的 Service,固然若是咱们想要在外网访问这两个服务的话能够经过建立对应的 Ingress 对象或者使用 NodePort 类型的 Service,咱们这里为了简单,直接使用 NodePort 类型的服务便可,编辑 grafana 和 prometheus-k8s 这两个 Service,将服务类型更改成 NodePort:
kubectl edit svc grafana -n monitoring kubectl edit svc prometheus-k8s -n monitoring kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ..... grafana NodePort 10.109.160.0 <none> 3000:31740/TCP 42m prometheus-k8s NodePort 10.97.195.143 <none> 9090:31310/TCP 42m
更改完成后,咱们就能够经过去访问上面的两个服务了,好比查看 prometheus 的 targets 页面:
咱们能够看到大部分的配置都是正常的,只有两三个没有管理到对应的监控目标,好比 kube-controller-manager 和 kube-scheduler 这两个系统组件,这就和 ServiceMonitor 的定义有关系了,咱们先来查看下 kube-scheduler 组件对应的 ServiceMonitor 资源的定义:(prometheus-serviceMonitorKubeScheduler.yaml)
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: monitoring spec: endpoints: - interval: 30s #30s获取一次信息 port: http-metrics jobLabel: k8s-app namespaceSelector: matchNames: - kube-system selector: matchLabels: k8s-app: kube-scheduler# 对应service的端口名# 表示去匹配某一命名空间中的service,若是想从全部的namespace中匹配用any: true# 匹配的 Service 的labels,若是使用mathLabels,则下面的全部标签都匹配时才会匹配该service,若是使用matchExpressions,则至少匹配一个标签的service都会被选择
上面是一个典型的 ServiceMonitor 资源文件的声明方式,上面咱们经过selector.matchLabels
在 kube-system 这个命名空间下面匹配具备k8s-app=kube-scheduler
这样的 Service,可是咱们系统中根本就没有对应的 Service,因此咱们须要手动建立一个 Service:(prometheus-kubeSchedulerService.yaml)
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-scheduler labels: k8s-app: kube-scheduler spec: selector: component: kube-scheduler ports: - name: http-metrics port: 10251 targetPort: 10251 protocol: TCP
其中最重要的是上面 labels 和 selector 部分,labels 区域的配置必须和咱们上面的 ServiceMonitor 对象中的 selector 保持一致,selector
下面配置的是component=kube-scheduler
,为何会是这个 label 标签呢,咱们能够去 describe 下 kube-scheduelr 这个 Pod:
$ kubectl describe pod kube-scheduler-k8s-master -n kube-system Name: kube-scheduler-k8s-master Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: k8s-master/172.16.138.40 Start Time: Tue, 19 Feb 2019 21:15:05 -0500 Labels: component=kube-scheduler tier=control-plane ......
咱们能够看到这个 Pod 具备component=kube-scheduler
和tier=control-plane
这两个标签,而前面这个标签具备更惟一的特性,因此使用前面这个标签较好,这样上面建立的 Service 就能够和咱们的 Pod 进行关联了,直接建立便可:
$ kubectl create -f prometheus-kubeSchedulerService.yaml $ kubectl get svc -n kube-system -l k8s-app=kube-scheduler NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-scheduler ClusterIP 10.103.165.58 <none> 10251/TCP 4m
建立完成后,隔一小会儿后去 prometheus 查看 targets 下面 kube-scheduler 的状态:
咱们能够看到如今已经发现了 target,可是抓取数据结果出错了,这个错误是由于咱们集群是使用 kubeadm 搭建的,其中 kube-scheduler 默认是绑定在127.0.0.1
上面的,而上面咱们这个地方是想经过节点的 IP 去访问,因此访问被拒绝了,咱们只要把 kube-scheduler 绑定的地址更改为0.0.0.0
便可知足要求,因为 kube-scheduler 是以静态 Pod 的形式运行在集群中的,因此咱们只须要更改静态 Pod 目录下面对应的 YAML (kube-scheduler.yaml
)文件便可:
$ cd /etc/kubernetes/manifests 将 kube-scheduler.yaml 文件中-command的--address地址更改为0.0.0.0 $ vim kube-scheduler.yaml apiVersion: v1 kind: Pod metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --address=0.0.0.0 - --kubeconfig=/etc/kubernetes/scheduler.conf - --leader-elect=true ....
修改完成后咱们将该文件从当前文件夹中移除,隔一下子再移回该目录,就能够自动更新了,而后再去看 prometheus 中 kube-scheduler 这个 target 是否已经正常了:
咱们来查看一下kube-controller-manager的ServiceMonitor资源的定义:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-controller-manager name: kube-controller-manager namespace: monitoring spec: endpoints: - interval: 30s metricRelabelings: - action: drop regex: etcd_(debugging|disk|request|server).* sourceLabels: - __name__ port: http-metrics jobLabel: k8s-app namespaceSelector: matchNames: - kube-system selector: matchLabels: k8s-app: kube-controller-manager
上面咱们能够看到是经过k8s-app: kube-controller-manager这个标签选择的service,但系统中没有这个service。这里咱们手动建立一个:
建立前咱们须要看肯定pod的标签:
$ kubectl describe pod kube-controller-manager-k8s-master -n kube-system Name: kube-controller-manager-k8s-master Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: k8s-master/172.16.138.40 Start Time: Tue, 19 Feb 2019 21:15:16 -0500 Labels: component=kube-controller-manager tier=control-plane ....
建立svc
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-controller-manager labels: k8s-app: kube-controller-manager spec: selector: component: kube-controller-manager ports: - name: http-metrics port: 10252 targetPort: 10252 protocol: TCP
建立完后,咱们查看targer
这里和上面是同一个问题。让咱们使用上面的方法修改。让咱们修改kube-controller-manager.yaml:
apiVersion: v1 kind: Pod metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" creationTimestamp: null labels: component: kube-controller-manager tier: control-plane name: kube-controller-manager namespace: kube-system spec: containers: - command: - kube-controller-manager - --node-monitor-grace-period=10s - --pod-eviction-timeout=10s - --address=0.0.0.0 #修改 ......
修改完成后咱们将该文件从当前文件夹中移除,隔一下子再移回该目录,就能够自动更新了,而后再去看 prometheus 中 kube-controller-manager 这个 target 是否已经正常了:
coredns启动的metrics端口是9153,咱们查看kube-system下的svc是否有这个端口:
kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE heapster ClusterIP 10.96.28.220 <none> 80/TCP 19d kube-controller-manager ClusterIP 10.99.208.51 <none> 10252/TCP 1h kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 188d kube-scheduler ClusterIP 10.103.165.58 <none> 10251/TCP 2h kubelet ClusterIP None <none> 10250/TCP 5h kubernetes-dashboard NodePort 10.103.15.27 <none> 443:30589/TCP 131d monitoring-influxdb ClusterIP 10.103.155.57 <none> 8086/TCP 19d tiller-deploy ClusterIP 10.104.114.83 <none> 44134/TCP 18d
这里咱们看到kube-dns没有metrics的端口,可是metrics后端是启动,因此咱们须要把这个端口经过svc暴露出来。建立svc:
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-prometheus-prometheus-coredns labels: k8s-app: prometheus-operator-coredns spec: selector: k8s-app: kube-dns ports: - name: metrics port: 9153 targetPort: 9153 protocol: TCP
这里咱们启动一个svc,labels是 k8s-app: prometheus-operator-coredns ,全部咱们须要修改DNS的serviceMonitor下的labels值。
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: coredns name: coredns namespace: monitoring spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 15s port: metrics jobLabel: k8s-app namespaceSelector: matchNames: - kube-system selector: matchLabels: k8s-app: prometheus-operator-coredns
建立查看这两个资源:
$ kubectl apply -f prometheus-serviceMonitorCoreDNS.yaml $ kubectl create -f prometheus-KubeDnsSvc.yaml $ kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-prometheus-prometheus-coredns ClusterIP 10.100.205.135 <none> 9153/TCP 1h
让咱们再去看 prometheus 中 coredns 这个 target 是否已经正常了:
上面的监控数据配置完成后,如今咱们能够去查看下 grafana 下面的 dashboard,一样使用上面的 NodePort 访问便可,第一次登陆使用 admin:admin 登陆便可,进入首页后,能够发现已经和咱们的 Prometheus 数据源关联上了,正常来讲能够看到一些监控图表了:
除了 Kubernetes 集群中的一些资源对象、节点以及组件须要监控,有的时候咱们可能还须要根据实际的业务需求去添加自定义的监控项,添加一个自定义监控的步骤也是很是简单的。
接下来演示如何添加 etcd 集群的监控。
不管是 Kubernetes 集群外的仍是使用 Kubeadm 安装在集群内部的 etcd 集群,咱们这里都将其视做集群外的独立集群,由于对于两者的使用方法没什么特殊之处。
对于 etcd 集群通常状况下,为了安全都会开启 https 证书认证的方式,因此要想让 Prometheus 访问到 etcd 集群的监控数据,就须要提供相应的证书校验。
因为咱们这里演示环境使用的是 Kubeadm 搭建的集群,咱们可使用 kubectl 工具去获取 etcd 启动的时候使用的证书路径:
$ kubectl get pods -n kube-system | grep etcd etcd-k8s-master 1/1 Running 2773 188d etcd-k8s-node01 1/1 Running 2 104d $ kubectl get pod etcd-k8s-master -n kube-system -o yaml ..... spec: containers: - command: - etcd - --advertise-client-urls=https://172.16.138.40:2379 - --initial-advertise-peer-urls=https://172.16.138.40:2380 - --initial-cluster=k8s-master=https://172.16.138.40:2380 - --listen-client-urls=https://127.0.0.1:2379,https://172.16.138.40:2379 - --listen-peer-urls=https://172.16.138.40:2380 - --cert-file=/etc/kubernetes/pki/etcd/server.crt - --client-cert-auth=true - --data-dir=/var/lib/etcd - --key-file=/etc/kubernetes/pki/etcd/server.key - --name=k8s-master - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt - --peer-client-cert-auth=true - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt - --snapshot-count=10000 - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt image: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd-amd64:3.2.18 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - /bin/sh - -ec - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo failureThreshold: 8 initialDelaySeconds: 15 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 15 name: etcd resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/etcd name: etcd-data - mountPath: /etc/kubernetes/pki/etcd name: etcd-certs ...... tolerations: - effect: NoExecute operator: Exists volumes: - hostPath: path: /var/lib/etcd type: DirectoryOrCreate name: etcd-data - hostPath: path: /etc/kubernetes/pki/etcd type: DirectoryOrCreate name: etcd-certs .....
咱们能够看到 etcd 使用的证书都对应在节点的 /etc/kubernetes/pki/etcd 这个路径下面,因此首先咱们将须要使用到的证书经过 secret 对象保存到集群中去:(在 etcd 运行的节点)
$ kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt secret/etcd-certs created
而后将上面建立的 etcd-certs 对象配置到 prometheus 资源对象中,直接更新 prometheus 资源对象便可:
nodeSelector: beta.kubernetes.io/os: linux replicas: 2 secrets: - etcd-certs
更新完成后,咱们就能够在 Prometheus 的 Pod 中获取到上面建立的 etcd 证书文件了,具体的路径咱们能够进入 Pod 中查看:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/ config_out/ console_libraries/ consoles/ prometheus.yml rules/ secrets/ /prometheus $ ls /etc/prometheus/secrets/etcd-certs/ ca.crt healthcheck-client.crt healthcheck-client.key /prometheus $
如今 Prometheus 访问 etcd 集群的证书已经准备好了,接下来建立 ServiceMonitor 对象便可(prometheus-serviceMonitorEtcd.yaml)
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcd-k8s spec: jobLabel: k8s-app endpoints: - port: port interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/ca.crt certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key insecureSkipVerify: true selector: matchLabels: k8s-app: etcd namespaceSelector: matchNames: - kube-system
上面咱们在 monitoring 命名空间下面建立了名为 etcd-k8s 的 ServiceMonitor 对象,基本属性和前面章节中的一致,匹配 kube-system 这个命名空间下面的具备 k8s-app=etcd 这个 label 标签的 Service,jobLabel 表示用于检索 job 任务名称的标签,和前面不太同样的地方是 endpoints 属性的写法,配置上访问 etcd 的相关证书,endpoints 属性下面能够配置不少抓取的参数,好比 relabel、proxyUrl,tlsConfig 表示用于配置抓取监控数据端点的 tls 认证,因为证书 serverName 和 etcd 中签发的可能不匹配,因此加上了 insecureSkipVerify=true
直接建立这个 ServiceMonitor 对象:
$ kubectl create -f prometheus-serviceMonitorEtcd.yaml
servicemonitor.monitoring.coreos.com/etcd-k8s created
ServiceMonitor 建立完成了,可是如今尚未关联的对应的 Service 对象,因此须要咱们去手动建立一个 Service 对象(prometheus-etcdService.yaml):
apiVersion: v1 kind: Service metadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcd spec: type: ClusterIP clusterIP: None ports: - name: port port: 2379 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcd subsets: - addresses: - ip: 172.16.138.40 nodeName: etcd-k8s-master - ip: 172.16.138.41 nodeName: etcd-k8s-node01 ports: - name: port port: 2379 protocol: TCP
咱们这里建立的 Service 没有采用前面经过 label 标签的形式去匹配 Pod 的作法,由于前面咱们说过不少时候咱们建立的 etcd 集群是独立于集群以外的,这种状况下面咱们就须要自定义一个 Endpoints,要注意 metadata 区域的内容要和 Service 保持一致,Service 的 clusterIP 设置为 None,对改知识点不太熟悉的,能够去查看咱们前面关于 Service 部分的讲解。
Endpoints 的 subsets 中填写 etcd 集群的地址便可,咱们这里是建立的是高可用测试集群,咱们建立的时候指定了node的主机IP地址(2个etcd也是不符合规范的。由于etcd是选举制,2个就等于一个是同样的。),直接建立该 Service 资源:
$ kubectl create -f prometheus-etcdService.yaml service/etcd-k8s created endpoints/etcd-k8s created
建立完成后,隔一下子去 Prometheus 的 Dashboard 中查看 targets,便会有 etcd 的监控项了:
数据采集到后,能够在 grafana 中导入编号为3070
的 dashboard,获取到 etcd 的监控图表。
如今咱们知道怎么自定义一个 ServiceMonitor 对象了,可是若是须要自定义一个报警规则的话呢?好比如今咱们去查看 Prometheus Dashboard 的 Alert 页面下面就已经有一些报警规则了,还有一些是已经触发规则的了:
可是这些报警信息是哪里来的呢?他们应该用怎样的方式通知咱们呢?咱们知道以前咱们使用自定义的方式能够在 Prometheus 的配置文件之中指定 AlertManager 实例和 报警的 rules 文件,如今咱们经过 Operator 部署的呢?咱们能够在 Prometheus Dashboard 的 Config 页面下面查看关于 AlertManager 的配置:
alerting: alert_relabel_configs: - separator: ; regex: prometheus_replica replacement: $1 action: labeldrop alertmanagers: - kubernetes_sd_configs: - role: endpoints namespaces: names: - monitoring scheme: http path_prefix: / timeout: 10s relabel_configs: - source_labels: [__meta_kubernetes_service_name] separator: ; regex: alertmanager-main replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: web replacement: $1 action: keep rule_files: - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
上面 alertmanagers 实例的配置咱们能够看到是经过角色为 endpoints 的 kubernetes 的服务发现机制获取的,匹配的是服务名为 alertmanager-main,端口名未 web 的 Service 服务,咱们查看下 alertmanager-main 这个 Service:
kubectl describe svc alertmanager-main -n monitoring Name: alertmanager-main Namespace: monitoring Labels: alertmanager=main Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"alertmanager":"main"},"name":"alertmanager-main","namespace":"monitoring"},... Selector: alertmanager=main,app=alertmanager Type: ClusterIP IP: 10.110.43.207 Port: web 9093/TCP TargetPort: web/TCP Endpoints: 10.244.0.31:9093,10.244.2.42:9093,10.244.3.40:9093 Session Affinity: None Events: <none>
能够看到服务名正是 alertmanager-main,Port 定义的名称也是 web,符合上面的规则,因此 Prometheus 和 AlertManager 组件就正确关联上了。而对应的报警规则文件位于:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/
目录下面全部的 YAML 文件。咱们能够进入 Prometheus 的 Pod 中验证下该目录下面是否有 YAML 文件:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/ monitoring-prometheus-k8s-rules.yaml /prometheus $ cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/monitoring-prometheus-k8s-rules.yaml groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod_name, container_name) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m]) ) record: namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate ...........
这个 YAML 文件实际上就是咱们以前建立的一个 PrometheusRule 文件包含的:
$ cat prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-k8s-rules namespace: monitoring spec: groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod_name, container_name) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m]) ) record: namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate .....
咱们这里的 PrometheusRule 的 name 为 prometheus-k8s-rules,namespace 为 monitoring,咱们能够猜测到咱们建立一个 PrometheusRule 资源对象后,会自动在上面的 prometheus-k8s-rulefiles-0 目录下面生成一个对应的<namespace>-<name>.yaml
文件,因此若是之后咱们须要自定义一个报警选项的话,只须要定义一个 PrometheusRule 资源对象便可。至于为何 Prometheus 可以识别这个 PrometheusRule 资源对象呢?这就须要查看咱们建立的 prometheus 这个资源对象了,里面有很是重要的一个属性 ruleSelector,用来匹配 rule 规则的过滤器,要求匹配具备 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 资源对象,如今明白了吧?
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
因此咱们要想自定义一个报警规则,只须要建立一个具备 prometheus=k8s 和 role=alert-rules 标签的 PrometheusRule 对象就好了,好比如今咱们添加一个 etcd 是否可用的报警,咱们知道 etcd 整个集群有一半以上的节点可用的话集群就是可用的,因此咱们判断若是不可用的 etcd 数量超过了一半那么就触发报警,建立文件 prometheus-etcdRules.yaml:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: etcd-rules namespace: monitoring spec: groups: - name: etcd rules: - alert: EtcdClusterUnavailable annotations: summary: etcd cluster small description: If one more etcd peer goes down the cluster will be unavailable expr: | count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1) for: 3m labels: severity: critical
.....
$ kubectl create -f prometheus-etcdRules.yam
注意 label 标签必定至少要有 prometheus=k8s 和 role=alert-rules,建立完成后,隔一下子再去容器中查看下 rules 文件夹:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/ monitoring-etcd-rules.yaml monitoring-prometheus-k8s-rules.yaml
能够看到咱们建立的 rule 文件已经被注入到了对应的 rulefiles 文件夹下面了,证实咱们上面的设想是正确的。而后再去 Prometheus Dashboard 的 Alert 页面下面就能够查看到上面咱们新建的报警规则了:
咱们知道了如何去添加一个报警规则配置项,可是这些报警信息用怎样的方式去发送呢?前面的课程中咱们知道咱们能够经过 AlertManager 的配置文件去配置各类报警接收器,如今咱们是经过 Operator 提供的 alertmanager 资源对象建立的组件,应该怎样去修改配置呢?
首先咱们将 alertmanager-main 这个 Service 改成 NodePort 类型的 Service,修改完成后咱们能够在页面上的 status 路径下面查看 AlertManager 的配置信息:
$ kubectl edit svc alertmanager-main -n monitoring
......
selector:
alertmanager: main
app: alertmanager
sessionAffinity: None
type: NodePort
.....
这些配置信息其实是来自于咱们以前在prometheus-operator/contrib/kube-prometheus/manifests
目录下面建立的 alertmanager-secret.yaml 文件:
apiVersion: v1 data: alertmanager.yaml: Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg== kind: Secret metadata: name: alertmanager-main namespace: monitoring type: Opaque
能够将 alertmanager.yaml 对应的 value 值作一个 base64 解码:
echo Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg== | base64 -d
解码出来的结果 "global": "resolve_timeout": "5m" "receivers": - "name": "null" "route": "group_by": - "job" "group_interval": "5m" "group_wait": "30s" "receiver": "null" "repeat_interval": "12h" "routes": - "match": "alertname": "DeadMansSwitch" "receiver": "null"
咱们能够看到内容和上面查看的配置信息是一致的,因此若是咱们想要添加本身的接收器,或者模板消息,咱们就能够更改这个文件:
global: resolve_timeout: 5m smtp_smarthost: 'smtp.qq.com:587' smtp_from: 'zhaikun1992@qq.com' smtp_auth_username: 'zhaikun1992@qq.com' smtp_auth_password: '***' smtp_hello: 'qq.com' smtp_require_tls: true templates: - "/etc/alertmanager-tmpl/wechat.tmpl" route: group_by: ['job', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 5m receiver: default routes: - receiver: 'wechat' group_wait: 10s match: alertname: CoreDNSDown receivers: - name: 'default' email_configs: - to: 'zhai_kun@suixingpay.com' send_resolved: true - name: 'wechat' wechat_configs: - corp_id: '***' to_party: '*' to_user: "**" agent_id: '***' api_secret: '***' send_resolved: true
将上面文件保存为 alertmanager.yaml,而后使用这个文件建立一个 Secret 对象:
#删除原secret对象 kubectl delete secret alertmanager-main -n monitoring secret "alertmanager-main" deleted #将本身的配置文件导入到新的secret kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
咱们添加了两个接收器,默认的经过邮箱进行发送,对于 CoreDNSDown 这个报警咱们经过 wechat 来进行发送,上面的步骤建立完成后,很快咱们就会收到一条钉钉消息:
一样邮箱中也会收到报警信息:
咱们再次查看 AlertManager 页面的 status 页面的配置信息能够看到已经变成上面咱们的配置信息了:
AlertManager 配置也可使用模板(.tmpl文件),这些模板能够与 alertmanager.yaml 配置文件一块儿添加到 Secret 对象中,好比:
apiVersion:v1 kind:secret metadata: name:alertmanager-example data: alertmanager.yaml:{BASE64_CONFIG} template_1.tmpl:{BASE64_TEMPLATE_1} template_2.tmpl:{BASE64_TEMPLATE_2} ...
模板会被放置到与配置文件相同的路径,固然要使用这些模板文件,还须要在 alertmanager.yaml 配置文件中指定:
templates: - '*.tmpl'
建立成功后,Secret 对象将会挂载到 AlertManager 对象建立的 AlertManager Pod 中去。
样例:咱们建立一个alertmanager-tmpl.yaml文件,添加以下内容:
{{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序: prometheus_alert 告警级别: {{ .Labels.severity }} 告警类型: {{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ .StartsAt.Format "2013-12-02 15:04:05" }} ========end========== {{ end }} {{ end }}
删除原secret对象
$ kubectl delete secret alertmanager-main -n monitoring secret "alertmanager-main" deleted
建立新的secret对象
$ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=alertmanager-tmpl.yaml -n monitoring secret/alertmanager-main created
过一会咱们的微信就会收到告警信息。固然这里标签订义的问题,获取的值不全,咱们能够根据实际状况自定义。
咱们想一个问题,若是在咱们的 Kubernetes 集群中有了不少的 Service/Pod,那么咱们都须要一个一个的去创建一个对应的 ServiceMonitor 对象来进行监控吗?这样岂不是又变得麻烦起来了?
为解决这个问题,Prometheus Operator 为咱们提供了一个额外的抓取配置的来解决这个问题,咱们能够经过添加额外的配置来进行服务发现进行自动监控。和前面自定义的方式同样,咱们想要在 Prometheus Operator 当中去自动发现并监控具备prometheus.io/scrape=true
这个 annotations 的 Service,以前咱们定义的 Prometheus 的配置以下:
- job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name
要想自动发现集群中的 Service,就须要咱们在 Service 的annotation
区域添加prometheus.io/scrape=true
的声明,将上面文件直接保存为 prometheus-additional.yaml,而后经过这个文件建立一个对应的 Secret 对象:
$ kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring secret/additional-configs created
建立完成后,会将上面配置信息进行 base64 编码后做为 prometheus-additional.yaml 这个 key 对应的值存在:
$ kubectl get secret additional-configs -n monitoring -o yaml apiVersion: v1 data: prometheus-additional.yaml: LSBqb2JfbmFtZTogJ2t1YmVybmV0ZXMtc2VydmljZS1lbmRwb2ludHMnCiAga3ViZXJuZXRlc19zZF9jb25maWdzOgogIC0gcm9sZTogZW5kcG9pbnRzCiAgcmVsYWJlbF9jb25maWdzOgogIC0gc291cmNlX2xhYmVsczogW19fbWV0YV9rdWJlcm5ldGVzX3NlcnZpY2VfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3NjcmFwZV0KICAgIGFjdGlvbjoga2VlcAogICAgcmVnZXg6IHRydWUKICAtIHNvdXJjZV9sYWJlbHM6IFtfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19zY2hlbWVdCiAgICBhY3Rpb246IHJlcGxhY2UKICAgIHRhcmdldF9sYWJlbDogX19zY2hlbWVfXwogICAgcmVnZXg6IChodHRwcz8pCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9hbm5vdGF0aW9uX3Byb21ldGhldXNfaW9fcGF0aF0KICAgIGFjdGlvbjogcmVwbGFjZQogICAgdGFyZ2V0X2xhYmVsOiBfX21ldHJpY3NfcGF0aF9fCiAgICByZWdleDogKC4rKQogIC0gc291cmNlX2xhYmVsczogW19fYWRkcmVzc19fLCBfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19wb3J0XQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IF9fYWRkcmVzc19fCiAgICByZWdleDogKFteOl0rKSg/OjpcZCspPzsoXGQrKQogICAgcmVwbGFjZW1lbnQ6ICQxOiQyCiAgLSBhY3Rpb246IGxhYmVsbWFwCiAgICByZWdleDogX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9sYWJlbF8oLispCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfbmFtZXNwYWNlXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZXNwYWNlCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9uYW1lXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZQo= kind: Secret metadata: creationTimestamp: 2019-03-20T03:38:37Z name: additional-configs namespace: monitoring resourceVersion: "29056864" selfLink: /api/v1/namespaces/monitoring/secrets/additional-configs uid: a579495b-4ac1-11e9-baf3-005056930126 type: Opaque
而后咱们只须要在声明 prometheus 的资源对象文件中添加上这个额外的配置:(prometheus-prometheus.yaml)
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: prometheus: k8s name: k8s namespace: monitoring spec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web baseImage: quay.io/prometheus/prometheus nodeSelector: beta.kubernetes.io/os: linux replicas: 2 secrets: - etcd-certs resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.5.0
添加完成后,直接更新 prometheus 这个 CRD 资源对象:
$ kubectl apply -f prometheus-prometheus.yaml
prometheus.monitoring.coreos.com/k8s configured
隔一小会儿,能够前往 Prometheus 的 Dashboard 中查看配置是否生效:
在 Prometheus Dashboard 的配置页面下面咱们能够看到已经有了对应的的配置信息了,可是咱们切换到 targets 页面下面却并无发现对应的监控任务,查看 Prometheus 的 Pod 日志:
$ kubectl logs -f prometheus-k8s-0 prometheus -n monitoring evel=error ts=2019-03-20T03:55:01.298281581Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:302: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list pods at the cluster scope" level=error ts=2019-03-20T03:55:02.29813427Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:301: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list services at the cluster scope" level=error ts=2019-03-20T03:55:02.298431046Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:300: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list endpoints at the cluster scope" level=error ts=2019-03-20T03:55:02.299312874Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:302: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list pods at the cluster scope" level=error ts=2019-03-20T03:55:03.299674406Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:301: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list services at the cluster scope" level=error ts=2019-03-20T03:55:03.299757543Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:300: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list endpoints at the cluster scope" level=error ts=2019-03-20T03:55:03.299907982Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:302: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list pods at the cluster scope"
能够看到有不少错误日志出现,都是xxx is forbidden
,这说明是 RBAC 权限的问题,经过 prometheus 资源对象的配置能够知道 Prometheus 绑定了一个名为 prometheus-k8s 的 ServiceAccount 对象,而这个对象绑定的是一个名为 prometheus-k8s 的 ClusterRole:(prometheus-clusterRole.yaml)
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus-k8s rules: - apiGroups: - "" resources: - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get
上面的权限规则中咱们能够看到明显没有对 Service 或者 Pod 的 list 权限,因此报错了,要解决这个问题,咱们只须要添加上须要的权限便可:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus-k8s rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get
更新上面的 ClusterRole 这个资源对象,而后重建下 Prometheus 的全部 Pod,正常就能够看到 targets 页面下面有 kubernetes-service-endpoints 这个监控任务了:
$ kubectl apply -f prometheus-clusterRole.yaml
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured
咱们这里自动监控了两个 Service,这两个都是coredns的,咱们在 Service 中有两个特殊的 annotations:
$ kubectl describe svc kube-dns -n kube-system Name: kube-dns Namespace: kube-system .... Annotations: prometheus.io/port=9153 prometheus.io/scrape=true
...
因此被自动发现了,固然咱们也能够用一样的方式去配置 Pod、Ingress 这些资源对象的自动发现。
上面咱们在修改完权限的时候,重启了 Prometheus 的 Pod,若是咱们仔细观察的话会发现咱们以前采集的数据已经没有了,这是由于咱们经过 prometheus 这个 CRD 建立的 Prometheus 并无作数据的持久化,咱们能够直接查看生成的 Prometheus Pod 的挂载状况就清楚了:
............ volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /prometheus name: prometheus-k8s-db - mountPath: /etc/prometheus/rules/prometheus-k8s-rulefiles-0 ......... volumes: - name: config secret: defaultMode: 420 secretName: prometheus-k8s - emptyDir: {}
咱们能够看到 Prometheus 的数据目录 /prometheus 其实是经过 emptyDir 进行挂载的,咱们知道 emptyDir 挂载的数据的生命周期和 Pod 生命周期一致的,因此若是 Pod 挂掉了,数据也就丢失了,这也就是为何咱们重建 Pod 后以前的数据就没有了的缘由,对应线上的监控数据确定须要作数据的持久化的,一样的 prometheus 这个 CRD 资源也为咱们提供了数据持久化的配置方法,因为咱们的 Prometheus 最终是经过 Statefulset 控制器进行部署的,因此咱们这里须要经过 storageclass 来作数据持久化, 咱们以前用rook已经搭建过storageclass。因此咱们就能够直接用了。咱们让prometheus 的 CRD 资源对象(prometheus-prometheus.yaml)中添加以下配置:
storage: volumeClaimTemplate: spec: storageClassName: rook-ceph-block resources: requests: storage: 10Gi
注意这里的 storageClassName 名字为上面咱们建立的 StorageClass 对象名称,而后更新 prometheus 这个 CRD 资源。更新完成后会自动生成两个 PVC 和 PV 资源对象:
$ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-dba11961-4ad6-11e9-baf3-005056930126 10Gi RWO Delete Bound monitoring/prometheus-k8s-db-prometheus-k8s-0 rook-ceph-block 1m pvc-dbc6bac5-4ad6-11e9-baf3-005056930126 10Gi RWO Delete Bound monitoring/prometheus-k8s-db-prometheus-k8s-1 rook-ceph-block 1m $ kubectl get pvc -n monitoring NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE prometheus-k8s-db-prometheus-k8s-0 Bound pvc-dba11961-4ad6-11e9-baf3-005056930126 10Gi RWO rook-ceph-block 2m prometheus-k8s-db-prometheus-k8s-1 Bound pvc-dbc6bac5-4ad6-11e9-baf3-005056930126 10Gi RWO rook-ceph-block 2m
如今咱们再去看 Prometheus Pod 的数据目录就能够看到是关联到一个 PVC 对象上了。
....... volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /prometheus name: prometheus-k8s-db subPath: prometheus-db - mountPath: /etc/prometheus/rules/prometheus-k8s-rulefiles-0 name: prometheus-k8s-rulefiles-0 ......... volumes: - name: prometheus-k8s-db persistentVolumeClaim: claimName: prometheus-k8s-db-prometheus-k8s-0 .........
如今即便咱们的 Pod 挂掉了,数据也不会丢失了。让咱们测试一下。
咱们先随便查一下数据
删除pod
kubectl delete pod prometheus-k8s-1 -n monitorin kubectl delete pod prometheus-k8s-0 -n monitorin
查看pod状态
kubectl get pod -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 2d alertmanager-main-1 2/2 Running 0 2d alertmanager-main-2 2/2 Running 0 2d grafana-7489c49998-pkl8w 1/1 Running 0 2d kube-state-metrics-d6cf6c7b5-7dwpg 4/4 Running 0 2d node-exporter-dlp25 2/2 Running 0 2d node-exporter-fghlp 2/2 Running 0 2d node-exporter-mxwdm 2/2 Running 0 2d node-exporter-r9v92 2/2 Running 0 2d prometheus-adapter-84cd9c96c9-n92n4 1/1 Running 0 2d prometheus-k8s-0 0/3 ContainerCreating 0 3s prometheus-k8s-1 3/3 Running 0 9s prometheus-operator-7b74946bd6-vmbcj 1/1 Running 0 2d
pod正在从新建立。等建立完成,咱们再查看一下数据
咱们的数据是正常的,没有丢失。