Prometheus 是一个开源监控系统,它自己已经成为了云原生中指标监控的事实标准,几乎全部 Kubernetes 的核心组件以及其它云原生系统都以 Prometheus 的指标格式输出本身的运行时监控信息。node
主要特性:linux
另外在Prometheus的整个生态系统中有各类可选组件,用于功能的扩充。nginx
CoreOS提供了一种名为Operator的管理工具,他是管理特定应用程序的控制器。经过扩展Kubernetes API以软件的方式帮助用户建立、配置和管理复杂的或又状态的应用程序实例(如etcd、Redis、MySQL、Prometheus等)。git
它经过Kubernetes的CRD(Custom Resource Definition,自定义资源定义)对Prometheus和Prometheus须要监控的服务进行部署和配置。github
Prometheus-Operator使用下面两种资源来配置Prometheus及其要监控的服务。api
首先咱们先来了解下 Prometheus-Operator 的架构图:
上图是 Prometheus-Operator 官方提供的架构图,其中 Operator 是最核心的部分,做为一个控制器,他会去建立 Prometheus 、 ServiceMonitor 、 AlertManager 以及 PrometheusRule 4个 CRD 资源对象,而后会一直监控并维持这4个资源对象的状态。浏览器
其中建立的 prometheus 这种资源对象就是做为 Prometheus Server 存在,而 ServiceMonitor 就是 exporter 的各类抽象, exporter是用来提供专门提供 metrics 数据接口的工具, Prometheus 就是经过 ServiceMonitor 提供的 metrics 数据接口去 pull 数据的。架构
固然 alertmanager 这种资源对象就是对应的 AlertManager 的抽象,而 PrometheusRule 是用来被 Prometheus 实例使用的报警规则文件。app
本文档基于Prometheus Operator对Prometheus监控系统而进行,完整的配置文件请参考https://github.com/coreos/prometheus-operator分布式
配置Prometheus-Operator以前须要先准备如下几个环境:
将alert.cnlinux.club
、grafana.cnlinux.club
、prom.cnlinux.club
三个域名的A记录
解析到负责均衡的IP10.31.90.200
。
修改/etc/kubernetes/manifests/ 目录下kube-controller-manager.yaml和kube-scheduler.yaml
将监听地址改为--address=0.0.0.0
,重启kubelet服务
systemctl restart kubelet.service
kubectl create ns monitoring
由于etcd是使用https访问的,因此prometheus的容器中也必需要etcd的证书去监控etcd集群,建立Secret就是将证书挂载到prometheus容器中,后续还须要在Prometheus-Operator的配置文件中使用此Secret。
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-clien t.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt
helm fetch stable/prometheus-operator
tar zxf prometheus-operator-1.8.0.tgz
并修改prometheus-operator目录下的values.yaml。
具体修改的配置以下(配置过多,其余未修改的就再也不展现了):
nameOverride: "p" alertmanager: ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx labels: {} hosts: - alert.cnlinux.club tls: [] alertmanagerSpec: storage: volumeClaimTemplate: spec: storageClassName: gluster-heketi accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi selector: {} grafana: enabled: true adminPassword: admin #grafana登陆密码 ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx labels: {} hosts: - grafana.cnlinux.club kubeApiServer: enabled: true tlsConfig: serverName: kubernetes insecureSkipVerify: true serviceMonitor: jobLabel: component selector: matchLabels: component: apiserver provider: kubernetes kubelet: enabled: true namespace: kube-system serviceMonitor: https: true kubeControllerManager: enabled: true endpoints: [] service: port: 10252 targetPort: 10252 selector: component: kube-controller-manager coreDns: enabled: true service: port: 9153 targetPort: 9153 selector: k8s-app: kube-dns kubeEtcd: enabled: true endpoints: [] service: port: 2379 targetPort: 2379 selector: component: etcd serviceMonitor: scheme: https insecureSkipVerify: false serverName: "" caFile: /etc/prometheus/secrets/etcd-certs/ca.crt certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key #secret etcd-certs挂载在prometheus的路径是/etc/prometheus/secrets/etcd-certs,证书文件名和secret同样 kubeScheduler: enabled: true endpoints: [] service: port: 10251 targetPort: 10251 selector: component: kube-scheduler prometheus: ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx labels: {} hosts: - prom.cnlinux.club prometheusSpec: secrets: [etcd-certs] #上面步骤建立etcd证书的secret storageSpec: volumeClaimTemplate: spec: storageClassName: gluster-heketi accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi selector: {}
[root@node-01 ~]# helm install --name p --namespace monitoring ./prometheus-operator NAME: p LAST DEPLOYED: Tue Feb 26 14:30:52 2019 NAMESPACE: monitoring STATUS: DEPLOYED RESOURCES: ==> v1beta1/DaemonSet NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE p-prometheus-node-exporter 6 6 1 6 1 <none> 5s ==> v1beta2/Deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE p-grafana 1 1 1 0 5s ==> v1/PrometheusRule NAME AGE p-alertmanager.rules 4s p-etcd 4s p-general.rules 4s p-k8s.rules 4s p-kube-apiserver.rules 4s p-kube-prometheus-node-alerting.rules 4s p-kube-prometheus-node-recording.rules 4s p-kube-scheduler.rules 4s p-kubernetes-absent 4s p-kubernetes-apps 4s p-kubernetes-resources 4s p-kubernetes-storage 4s p-kubernetes-system 4s p-node.rules 4s p-prometheus-operator 4s p-prometheus.rules 4s ==> v1/Pod(related) NAME READY STATUS RESTARTS AGE p-prometheus-node-exporter-48lw9 0/1 Running 0 5s p-prometheus-node-exporter-7lpvx 0/1 Running 0 5s p-prometheus-node-exporter-8q577 1/1 Running 0 5s p-prometheus-node-exporter-ls8cx 0/1 Running 0 5s p-prometheus-node-exporter-nbl2g 0/1 Running 0 5s p-prometheus-node-exporter-v7tb5 0/1 Running 0 5s p-grafana-fcf4dc6bb-9c6pg 0/3 ContainerCreating 0 5s p-kube-state-metrics-57d788d69-vmh42 0/1 Running 0 5s p-operator-666b958c4f-wvd4h 1/1 Running 0 5s ==> v1beta1/ClusterRole NAME AGE p-kube-state-metrics 6s psp-p-prometheus-node-exporter 6s ==> v1/Service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE p-grafana ClusterIP 10.245.103.159 <none> 80/TCP 5s p-kube-state-metrics ClusterIP 10.245.150.181 <none> 8080/TCP 5s p-prometheus-node-exporter ClusterIP 10.245.98.70 <none> 9100/TCP 5s p-alertmanager ClusterIP 10.245.10.5 <none> 9093/TCP 5s p-coredns ClusterIP None <none> 9153/TCP 5s p-kube-controller-manager ClusterIP None <none> 10252/TCP 5s p-kube-etcd ClusterIP None <none> 2379/TCP 5s p-kube-scheduler ClusterIP None <none> 10251/TCP 5s p-operator ClusterIP 10.245.31.238 <none> 8080/TCP 5s p-prometheus ClusterIP 10.245.109.85 <none> 9090/TCP 5s ==> v1/ClusterRoleBinding NAME AGE p-grafana-clusterrolebinding 6s p-alertmanager 6s p-operator 6s p-operator-psp 6s p-prometheus 6s p-prometheus-psp 6s ==> v1beta1/ClusterRoleBinding NAME AGE p-kube-state-metrics 6s psp-p-prometheus-node-exporter 6s ==> v1beta1/Role NAME AGE p-grafana 6s ==> v1/RoleBinding NAME AGE p-prometheus-config 5s p-prometheus 4s p-prometheus 4s ==> v1/Deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE p-operator 1 1 1 1 5s ==> v1/Alertmanager NAME AGE p-alertmanager 5s ==> v1/Secret NAME TYPE DATA AGE p-grafana Opaque 3 6s alertmanager-p-alertmanager Opaque 1 6s ==> v1/ServiceAccount NAME SECRETS AGE p-grafana 1 6s p-kube-state-metrics 1 6s p-prometheus-node-exporter 1 6s p-alertmanager 1 6s p-operator 1 6s p-prometheus 1 6s ==> v1beta1/Deployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE p-kube-state-metrics 1 1 1 0 5s ==> v1beta1/Ingress NAME HOSTS ADDRESS PORTS AGE p-grafana grafana.cnlinux.club 80 5s p-alertmanager alert.cnlinux.club 80 5s p-prometheus prom.cnlinux.club 80 5s ==> v1beta1/PodSecurityPolicy NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES p-grafana false RunAsAny RunAsAny RunAsAny RunAsAny false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim p-prometheus-node-exporter false RunAsAny RunAsAny MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim,hostPath p-alertmanager false RunAsAny RunAsAny MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim p-operator false RunAsAny RunAsAny MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim p-prometheus false RunAsAny RunAsAny MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim ==> v1/ConfigMap NAME DATA AGE p-grafana-config-dashboards 1 6s p-grafana 1 6s p-grafana-datasource 1 6s p-etcd 1 6s p-grafana-coredns-k8s 1 6s p-k8s-cluster-rsrc-use 1 6s p-k8s-node-rsrc-use 1 6s p-k8s-resources-cluster 1 6s p-k8s-resources-namespace 1 6s p-k8s-resources-pod 1 6s p-nodes 1 6s p-persistentvolumesusage 1 6s p-pods 1 6s p-statefulset 1 6s ==> v1beta1/RoleBinding NAME AGE p-grafana 5s ==> v1/Prometheus NAME AGE p-prometheus 4s ==> v1/ServiceMonitor NAME AGE p-alertmanager 4s p-coredns 4s p-apiserver 4s p-kube-controller-manager 4s p-kube-etcd 4s p-kube-scheduler 4s p-kube-state-metrics 4s p-kubelet 4s p-node-exporter 4s p-operator 4s p-prometheus 4s ==> v1/ClusterRole NAME AGE p-grafana-clusterrole 6s p-alertmanager 6s p-operator 6s p-operator-psp 6s p-prometheus 6s p-prometheus-psp 6s ==> v1/Role NAME AGE p-prometheus-config 6s p-prometheus 4s p-prometheus 4s NOTES: The Prometheus Operator has been installed. Check its status by running: kubectl --namespace monitoring get pods -l "release=p" Visit https://github.com/coreos/prometheus-operator for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
在部署中有几个坑,我在此列举一下,你们配置的时候须要注意一下
因为alertmanager和prometheus都是有状态的statefulsets,因此咱们使用了gluster的存储,并经过 prometheus-operator自动建立pvc,若是charts的release 名称过长会致使pvc建立失败。
因此在上面的安装中指定了release的名字为phelm install --name p --namespace monitoring ./prometheus-operator
,而且在配置文件中也修改了namenameOverride: "p"
。
Warning ProvisioningFailed 3s (x2 over 40s) persistentvolume-controller Failed to provision volume with StorageClass "gluster-heketi": failed to create volume: failed to create endpoint/service default/glusterfs-dynamic-72488422-3428-11e9-a74b-005056824bdc: failed to create endpoint: Endpoints "glusterfs-dynamic-72488422-3428-11e9-a74b-005056824bdc" is invalid: metadata.labels: Invalid value: "alertmanager-prom-alertmanager-db-alertmanager-prom-alertmanager-0": must be no more than 63 characters
首先要查看pod的标签,而后修改修改prometheus-operator目录下的values.yaml对应的标签。
[root@node-01 ~]# kubectl -n kube-system get pod --show-labels NAME READY STATUS RESTARTS AGE LABELS coredns-7f65654f74-6gxps 1/1 Running 8 5d22h k8s-app=kube-dns,pod-template-hash=7f65654f74 etcd-node-01 1/1 Running 1 32d component=etcd,tier=control-plane kube-controller-manager-node-01 1/1 Running 0 39h component=kube-controller-manager,tier=control-plane kube-scheduler-node-01 1/1 Running 0 23h component=kube-scheduler,tier=control-plane ...
须要注意的是必定要修改prometheus-operator目录下values.yaml对应的标签,不能在安装的时候指定外部的配置文件来覆盖labels值,这多是个bug,指定外部配置时没法覆盖labels而是追加,会致使prometheus没法抓取到数据。
若是在修改了配置文件values.yaml,可使用如下命令更新prometheus-operator
helm upgrade RELEASE_NAME ./prometheus-operator
若是须要删除,可使用如下命令
helm del --purge RELEASE_NAME kubectl -n monitoring delete crd prometheuses.monitoring.coreos.com kubectl -n monitoring delete crd prometheusrules.monitoring.coreos.com kubectl -n monitoring delete crd servicemonitors.monitoring.coreos.com kubectl -n monitoring delete crd alertmanagers.monitoring.coreos.com
部署完后能够在浏览器访问prometheushttp://prom.cnlinux.club/targets
,能够看到以下图,全部的项都有数据,而且是UP状态的。
浏览器访问grafana http://grafana.cnlinux.club/
,能够看到各类资源的监控图。
用户名为admin,密码为values.yaml配置文件中指定的密码
浏览器访问Alertmanagerhttp://alert.cnlinux.club/
,能够看到报警项。
至此全部的安装完成,下一篇将详细说明使用prometheus监控自定义服务,以及报警设置。若有问题欢迎在下面留言交流。但愿你们多多关注和点赞,谢谢!