以前部署web网站的时候,架构图中有一环节是监控部分,而且搭建一套有效的监控平台对于运维来讲很是之重要,只有这样才能更有效率的保证咱们的服务器和服务的稳定运行,常见的开源监控软件有好几种,如zabbix、Nagios、open-flcon还有prometheus,每一种有着各自的优劣势,感谢的童鞋能够自行百度,可是与k8s集群监控,相对于而已更加友好的是Prometheus,今天咱们就看看如何部署一套Prometheus全方位监控K8Snode
1.Prometheus架构linux
2.K8S监控指标及实现思路ios
3.在K8S平台部署Prometheusgit
4.基于K8S服务发现的配置解析github
5.在K8S平台部署Grafanaweb
6.监控K8S集群中Pod、Node、资源对象docker
7.使用Grafana可视化展现Prometheus监控数据数据库
8.告警规则与告警通知vim
Prometheus(普罗米修斯)是一个最初在SoundCloud上构建的监控系统。自2012年成为社区开源项目,拥有很是活跃的开发人员和用户社区。为强调开源及独立维护,Prometheus于2016年加入云原生云计算基金会(CNCF),成为继Kubernetes以后的第二个托管项目。 官网地址: https://prometheus.io https://github.com/prometheusapi
Prometheus将全部数据存储为时间序列;具备相同度量名称以及标签属于同一个指标。 每一个时间序列都由度量标准名称和一组键值对(也成为标签)惟一标识。 时间序列格式:
<metric name>{<label name>=<label value>, ...} 示例:api_http_requests_total{method="POST", handler="/messages"}
实例:能够抓取的目标称为实例(Instances) 做业:具备相同目标的实例集合称为做业(Job) scrape_configs: -job_name: 'prometheus' static_configs: -targets: ['localhost:9090'] -job_name: 'node' static_configs: -targets: ['192.168.1.10:9090']
Kubernetes自己监控
Pod监控
监控指标 | 具体实现 | 举例 |
---|---|---|
Pod性能 | cAdvisor | 容器CPU |
Node性能 | node-exporter 节点CPU | 内存利用率 |
K8S资源对象 | kube-state-metrics | Pod/Deployment/Service |
服务发现: | ||
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config |
ip地址 | 角色 | 备注 |
---|---|---|
192.168.73.136 | nfs | |
192.168.73.138 | k8s-master | |
192.168.73.139 | k8s-node01 | |
192.168.73.140 | k8s-node02 | |
192.168.73.135 | k8s-node03 |
[root@k8s-master src]# git clone https://github.com/zhangdongdong7/k8s-prometheus.git Cloning into 'k8s-prometheus'... remote: Enumerating objects: 3, done. remote: Counting objects: 100% (3/3), done. remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 Unpacking objects: 100% (3/3), done. [root@k8s-master src]# cd k8s-prometheus/ [root@k8s-master k8s-prometheus]# ls alertmanager-configmap.yaml kube-state-metrics-rbac.yaml prometheus-rbac.yaml alertmanager-deployment.yaml kube-state-metrics-service.yaml prometheus-rules.yaml alertmanager-pvc.yaml node_exporter-0.17.0.linux-amd64.tar.gz prometheus-service.yaml alertmanager-service.yaml node_exporter.sh prometheus-statefulset-static-pv.yaml grafana.yaml OWNERS prometheus-statefulset.yaml kube-state-metrics-deployment.yaml prometheus-configmap.yaml README.md
RBAC(Role-Based Access Control,基于角色的访问控制):负责完成受权(Authorization)工做。 编写受权yaml
[root@k8s-master prometheus-k8s]# vim prometheus-rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - "/metrics" verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile "prometheus-rbac.yaml" 55L, 1080C 1,1 Top apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile 1,1 Top apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - "/metrics" verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rbac.yaml
使用Configmap保存不须要加密配置信息 其中须要把nodes中ip地址根据本身的地址进行修改
[root@k8s-master prometheus-k8s]# vim prometheus-configmap.yaml # Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/ apiVersion: v1 kind: ConfigMap # metadata: name: prometheus-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: prometheus.yml: | rule_files: - /etc/config/rules/*.rules scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090 - job_name: kubernetes-nodes scrape_interval: 30s static_configs: - targets: - 192.168.73.135:9100 - 192.168.73.138:9100 - 192.168.73.139:9100 - 192.168.73.140:9100 - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_service_name - __meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-kubelet kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __metrics_path__ replacement: /metrics/cadvisor scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-service-endpoints kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_service_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-services kubernetes_sd_configs: - role: service metrics_path: /probe params: module: - http_2xx relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_probe - source_labels: - __address__ target_label: __param_target - replacement: blackbox target_label: __address__ - source_labels: - __param_target target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - source_labels: - __meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name alerting: alertmanagers: - static_configs: - targets: ["alertmanager:80"]
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-configmap.yaml
这里使用storageclass进行动态供给,给prometheus的数据进行持久化,具体实现办法,能够查看以前的文章《k8s中的NFS动态存储供给》,除此以外可使用静态供给的prometheus-statefulset-static-pv.yaml进行持久化
[root@k8s-master prometheus-k8s]# vim prometheus-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus namespace: kube-system labels: k8s-app: prometheus kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v2.2.1 spec: serviceName: "prometheus" replicas: 1 podManagementPolicy: "Parallel" updateStrategy: type: "RollingUpdate" selector: matchLabels: k8s-app: prometheus template: metadata: labels: k8s-app: prometheus annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical serviceAccountName: prometheus initContainers: - name: "init-chown-data" image: "busybox:latest" imagePullPolicy: "IfNotPresent" command: ["chown", "-R", "65534:65534", "/data"] volumeMounts: - name: prometheus-data mountPath: /data subPath: "" containers: - name: prometheus-server-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9090/-/reload volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi - name: prometheus-server image: "prom/prometheus:v2.2.1" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle ports: - containerPort: 9090 readinessProbe: httpGet: path: /-/ready port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /-/healthy port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 # based on 10 running nodes with 30 pods each resources: limits: cpu: 200m memory: 1000Mi requests: cpu: 200m memory: 1000Mi volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data subPath: "" - name: prometheus-rules mountPath: /etc/config/rules terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config - name: prometheus-rules configMap: name: prometheus-rules volumeClaimTemplates: - metadata: name: prometheus-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "16Gi"
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-statefulset.yaml
检查状态
[root@k8s-master prometheus-k8s]# kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE alertmanager-5d75d5688f-fmlq6 2/2 Running 0 8d coredns-5bd5f9dbd9-wv45t 1/1 Running 1 8d grafana-0 1/1 Running 2 14d kube-state-metrics-7c76bdbf68-kqqgd 2/2 Running 6 13d kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 14d prometheus-0 2/2 Running 6 14d
能够看到一个prometheus-0的pod,这就刚才使用statefulset控制器进行的有状态部署,状态为Runing则是正常,若是不为Runing可使用kubectl describe pod prometheus-0 -n kube-system查看报错详情
此处使用nodePort固定一个访问端口,便于记忆
[root@k8s-master prometheus-k8s]# vim prometheus-service.yaml kind: Service apiVersion: v1 metadata: name: prometheus namespace: kube-system labels: kubernetes.io/name: "Prometheus" kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile spec: type: NodePort ports: - name: http port: 9090 protocol: TCP targetPort: 9090 nodePort: 30090 selector: k8s-app: prometheus
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-service.yaml
检查
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 8d pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 14d pod/prometheus-0 2/2 Running 6 14d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 13d service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 16d service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 13d
使用任意一个NodeIP加端口进行访问,访问地址:http://NodeIP:Port ,此例就是:http://192.168.73.139:30090 访问成功的界面如图所示:
经过上面的web访问,能够看出prometheus自带的UI界面是没有多少功能的,可视化展现的功能不完善,不能知足平常的监控所需,所以经常咱们须要再结合Prometheus+Grafana的方式来进行可视化的数据展现 官网地址: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus https://grafana.com/grafana/download 刚才下载的项目中已经写好了Grafana的yaml,根据本身的环境进行修改
[root@k8s-master prometheus-k8s]# vim grafana.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: grafana namespace: kube-system spec: serviceName: "grafana" replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana ports: - containerPort: 3000 protocol: TCP resources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi volumeMounts: - name: grafana-data mountPath: /var/lib/grafana subPath: grafana securityContext: fsGroup: 472 runAsUser: 472 volumeClaimTemplates: - metadata: name: grafana-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "1Gi" --- apiVersion: v1 kind: Service metadata: name: grafana namespace: kube-system spec: type: NodePort ports: - port : 80 targetPort: 3000 nodePort: 30091 selector: app: grafana
使用任意一个NodeIP加端口进行访问,访问地址:http://NodeIP:Port ,此例就是:http://192.168.73.139:30091 成功访问界面以下,会须要进行帐号密码登录,默认帐号密码都为admin,登录以后会让修改密码
登录以后的界面以下
第一步须要进行数据源添加,点击create your first data source数据库图标,根据下图所示进行添加便可
第二步,添加完了以后点击底部的绿色的Save&Test,会成功提示Data sourse is working,则表示数据源添加成功
Pod kubelet的节点使用cAdvisor提供的metrics接口获取该节点全部Pod和容器相关的性能指标数据。 暴露接口地址: https://NodeIP:10255/metrics/cadvisor https://NodeIP:10250/metrics/cadvisor
Node 使用node_exporter收集器采集节点资源利用率。 https://github.com/prometheus/node_exporter 使用文档:https://prometheus.io/docs/guides/node-exporter/ 使用node_exporter.sh脚本分别在三台服务器上部署node_exporter收集器,不须要修改可直接运行脚本
[root@k8s-master prometheus-k8s]# cat node_exporter.sh #!/bin/bash wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz tar zxf node_exporter-0.17.0.linux-amd64.tar.gz mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter cat <<EOF >/usr/lib/systemd/system/node_exporter.service [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl restart node_exporter [root@k8s-master prometheus-k8s]# ./node_exporter.sh
[root@k8s-master prometheus-k8s]# ps -ef|grep node_exporter root 6227 1 0 Oct08 ? 00:06:43 /usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service root 118269 117584 0 23:27 pts/0 00:00:00 grep --color=auto node_exporter
资源对象 kube-state-metrics采集了k8s中各类资源对象的状态信息,只须要在master节点部署就行 https://github.com/kubernetes/kube-state-metrics
[root@k8s-master prometheus-k8s]# vim kube-state-metrics-rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["list", "watch"] - apiGroups: ["extensions"] resources: - daemonsets - deployments - replicasets verbs: ["list", "watch"] - apiGroups: ["apps"] resources: - statefulsets verbs: ["list", "watch"] - apiGroups: ["batch"] resources: - cronjobs - jobs verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: - horizontalpodautoscalers verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: kube-state-metrics-resizer namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - pods verbs: ["get"] - apiGroups: ["extensions"] resources: - deployments resourceNames: ["kube-state-metrics"] verbs: ["get", "update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: kube-state-metrics-resizer subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system [root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-rbac.yaml
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: kube-system labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v1.3.0 spec: selector: matchLabels: k8s-app: kube-state-metrics version: v1.3.0 replicas: 1 template: metadata: labels: k8s-app: kube-state-metrics version: v1.3.0 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: lizhenliang/kube-state-metrics:v1.3.0 ports: - name: http-metrics containerPort: 8080 - name: telemetry containerPort: 8081 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 - name: addon-resizer image: lizhenliang/addon-resizer:1.8.3 resources: limits: cpu: 100m memory: 30Mi requests: cpu: 100m memory: 30Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace volumeMounts: - name: config-volume mountPath: /etc/config command: - /pod_nanny - --config-dir=/etc/config - --container=kube-state-metrics - --cpu=100m - --extra-cpu=1m - --memory=100Mi - --extra-memory=2Mi - --threshold=5 - --deployment=kube-state-metrics volumes: - name: config-volume configMap: name: kube-state-metrics-config --- # Config map for resource configuration. apiVersion: v1 kind: ConfigMap metadata: name: kube-state-metrics-config namespace: kube-system labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile data: NannyConfiguration: |- apiVersion: nannyconfig/v1alpha1 kind: NannyConfiguration [root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-deployment.yaml
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-service.yaml apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "kube-state-metrics" annotations: prometheus.io/scrape: 'true' spec: ports: - name: http-metrics port: 8080 targetPort: http-metrics protocol: TCP - name: telemetry port: 8081 targetPort: telemetry protocol: TCP selector: k8s-app: kube-state-metrics [root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-service.yaml
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 0 9d pod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 9d pod/grafana-0 1/1 Running 2 15d pod/kube-state-metrics-7c76bdbf68-kqqgd 2/2 Running 6 14d pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 16d pod/prometheus-0 2/2 Running 6 15d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 13d service/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 15d service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 14d service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 14d service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 17d service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 14d [root@k8s-master prometheus-k8s]#
一般在使用Prometheus采集数据的时候咱们须要监控K8S集群中Pod、Node、资源对象,所以咱们须要安装对应的插件和资源采集器来提供api进行数据获取,在4.3中咱们已经配置好,咱们也可使用Prometheus的UI界面中的Staus菜单下的Target中的各个采集器的状态状况,如图所示:
只有当咱们各个Target的状态都是UP状态时,咱们可使用自带的的界面去获取到某一监控项的相关的数据,如图所示:
从上面的图中能够看出Prometheus的界面可视化展现的功能较单一,不能知足需求,所以咱们须要结合Grafana来进行可视化展现Prometheus监控数据,在上一章节,已经成功部署了Granfana,所以须要在使用的时候添加dashboard和Panel来设计展现相关的监控项,可是实际上在Granfana社区里面有不少成熟的模板,咱们能够直接使用,而后根据本身的环境修改Panel中的查询语句来获取数据 https://grafana.com/grafana/dashboards
推荐模板:
当模板添加以后若是某一个Panel不显示数据,能够点击Panel上的编辑,查询PromQL语句,而后去Prometheus本身的界面上进行调试PromQL语句是否能够获取到值,最后调整以后的监控界面如图所示
资源状态监控:6417 同理,添加资源状态的监控模板,而后通过调整以后的监控界面如图所示,能够获取到k8s中各类资源状态的监控展现
Node监控:9276 同理,添加资源状态的监控模板,而后通过调整以后的监控界面如图所示,能够获取到各个node上的基本状况
咱们以Email来进行实现告警信息的发送
[root@k8s-master prometheus-k8s]# vim prometheus-rules.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-system data: general.rules: | groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} 中止工做" description: "{{ $labels.instance }} job {{ $labels.job }} 已经中止5分钟以上." node.rules: | groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率太高" description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 10 0 > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 内存使用率太高" description: "{{ $labels.instance }}内存使用大于80% (当前值: {{ $value }})" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率太高" description: "{{ $labels.instance }}CPU使用大于60% (当前值: {{ $value }})" [root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rules.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: 'mail.goldwind.com.cn:587' #登录邮件进行查看 smtp_from: 'goldwindscada@goldwind.com.cn' #根据本身申请的发件邮箱进行配置 smtp_auth_username: 'goldwindscada@goldwind.com.cn' smtp_auth_password: 'Dbadmin@123' receivers: - name: default-receiver email_configs: - to: "zhangdongdong27459@goldwind.com.cn" route: group_interval: 1m group_wait: 10s receiver: default-receiver repeat_interval: 1m [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-configmap.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "2Gi" [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-pvc.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager namespace: kube-system labels: k8s-app: alertmanager kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v0.14.0 spec: replicas: 1 selector: matchLabels: k8s-app: alertmanager version: v0.14.0 template: metadata: labels: k8s-app: alertmanager version: v0.14.0 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical containers: - name: prometheus-alertmanager image: "prom/alertmanager:v0.14.0" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data - --web.external-url=/ ports: - containerPort: 9093 readinessProbe: httpGet: path: /#/status port: 9093 initialDelaySeconds: 30 timeoutSeconds: 30 volumeMounts: - name: config-volume mountPath: /etc/config - name: storage-volume mountPath: "/data" subPath: "" resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi - name: prometheus-alertmanager-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9093/-/reload volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi volumes: - name: config-volume configMap: name: alertmanager-config - name: storage-volume persistentVolumeClaim: claimName: alertmanager [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-deployment.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-service.yaml apiVersion: v1 kind: Service metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "Alertmanager" spec: ports: - name: http port: 80 protocol: TCP targetPort: 9093 selector: k8s-app: alertmanager type: "ClusterIP" [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-service.yaml
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 4 10d 172.17.15.2 192.168.73.140 <none> <none> pod/coredns-5bd5f9dbd9-qxvmz 1/1 Running 0 42m 172.17.33.2 192.168.73.138 <none> <none> pod/grafana-0 1/1 Running 3 16d 172.17.31.2 192.168.73.139 <none> <none> pod/kube-state-metrics-7c76bdbf68-hv56m 2/2 Running 0 23h 172.17.15.3 192.168.73.140 <none> <none> pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 6 17d 172.17.31.4 192.168.73.139 <none> <none> pod/prometheus-0 2/2 Running 8 16d 172.17.83.2 192.168.73.135 <none> <none> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 14d k8s-app=alertmanager service/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 16d app=grafana service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 42m k8s-app=kube-dns service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 15d k8s-app=kube-state-metrics service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 18d k8s-app=kubernetes-dashboard service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 15d k8s-app=prometheus