手动部署k8s-prometheus

时间 2019-12-10

标签手动部署 k8s prometheus 繁體版

原文原文链接

简介

Prometheus 最初是 SoundCloud 构建的开源系统监控和报警工具，是一个独立的开源项目，于2016年加入了 CNCF 基金会，做为继 Kubernetes 以后的第二个托管项目。node

特征

Prometheus 相比于其余传统监控工具主要有如下几个特色：git

具备由 metric 名称和键/值对标识的时间序列数据的多维数据模型
有一个灵活的查询语言
不依赖分布式存储，只和本地磁盘有关
经过 HTTP 的服务拉取时间序列数据
也支持推送的方式来添加时间序列数据
还支持经过服务发现或静态配置发现目标
多种图形和仪表板支持

组件

Prometheus 由多个组件组成，可是其中许多组件是可选的：github

Prometheus Server：用于抓取指标、存储时间序列数据
exporter：暴露指标让任务来抓
pushgateway：push 的方式将指标数据推送到该网关
alertmanager：处理报警的报警组件
adhoc：用于数据查询

大多数 Prometheus 组件都是用 Go 编写的，所以很容易构建和部署为静态的二进制文件。web

架构

下图是 Prometheus 官方提供的架构及其一些相关的生态系统组件：shell

架构后端

总体流程比较简单，Prometheus 直接接收或者经过中间的 Pushgateway 网关被动获取指标数据，在本地存储全部的获取的指标数据，并对这些数据进行一些规则整理，用来生成一些聚合数据或者报警信息，Grafana 或者其余工具用来可视化这些数据。api

安装

因为 Prometheus 是 Golang 编写的程序，因此要安装的话也很是简单，只须要将二进制文件下载下来直接执行便可，前往地址：https://prometheus.io/download 下载咱们对应的版本便可。架构

Prometheus 是经过一个 YAML 配置文件来进行启动的，若是咱们使用二进制的方式来启动的话，可使用下面的命令：app

$ ./prometheus --config.file=prometheus.yml

其中 prometheus.yml 文件的基本配置以下：分布式

global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first.rules" # - "second.rules" scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090']

上面这个配置文件中包含了3个模块：global、rule_files 和 scrape_configs。

其中 global 模块控制 Prometheus Server 的全局配置：

scrape_interval：表示 prometheus 抓取指标数据的频率，默认是15s，咱们能够覆盖这个值
evaluation_interval：用来控制评估规则的频率，prometheus 使用规则产生新的时间序列数据或者产生警报

rule_files 模块制定了规则所在的位置，prometheus 能够根据这个配置加载规则，用于生成新的时间序列数据或者报警信息，当前咱们没有配置任何规则。

scrape_configs 用于控制 prometheus 监控哪些资源。因为 prometheus 经过 HTTP 的方式来暴露的它自己的监控数据，prometheus 也可以监控自己的健康状况。在默认的配置里有一个单独的 job，叫作prometheus，它采集 prometheus 服务自己的时间序列数据。这个 job 包含了一个单独的、静态配置的目标：监听 localhost 上的9090端口。prometheus 默认会经过目标的/metrics路径采集 metrics。因此，默认的 job 经过 URL：http://localhost:9090/metrics采集 metrics。收集到的时间序列包含 prometheus 服务自己的状态和性能。若是咱们还有其余的资源须要监控的话，直接配置在该模块下面就能够了。

因为咱们这里是要跑在 Kubernetes 系统中，因此咱们直接用 Docker 镜像的方式运行便可。

为了方便管理，咱们将全部的资源对象都安装在kube-ops的 namespace 下面，没有的话须要提早安装。

为了可以方便的管理配置文件，咱们这里将 prometheus.yml 文件用 ConfigMap 的形式进行管理：（prometheus-cm.yaml）

apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-ops data: prometheus.yml: | global: scrape_interval: 15s scrape_timeout: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']

咱们这里暂时只配置了对 prometheus 的监控，而后建立该资源对象：

$ kubectl create -f prometheus-cm.yaml
configmap "prometheus-config" created

配置文件建立完成了，之后若是咱们有新的资源须要被监控，咱们只须要将上面的 ConfigMap 对象更新便可。如今咱们来建立 prometheus 的 Pod 资源：(prometheus-deploy.yaml)

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus namespace: kube-ops labels: app: prometheus spec: template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus containers: - image: prom/prometheus:v2.4.3 name: prometheus command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" - "--web.enable-admin-api" # 控制对admin HTTP API的访问，其中包括删除时间序列等功能 - "--web.enable-lifecycle" # 支持热更新，直接执行localhost:9090/-/reload当即生效 ports: - containerPort: 9090 protocol: TCP name: http volumeMounts: - mountPath: "/prometheus" subPath: prometheus name: data - mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 512Mi limits: cpu: 100m memory: 512Mi securityContext: runAsUser: 0 volumes: - name: data persistentVolumeClaim: claimName: prometheus - configMap: name: prometheus-config name: config-volume

咱们在启动程序的时候，除了指定了 prometheus.yml 文件以外，还经过参数storage.tsdb.path指定了 TSDB 数据的存储路径、经过storage.tsdb.retention设置了保留多长时间的数据，还有下面的web.enable-admin-api参数能够用来开启对 admin api 的访问权限，参数web.enable-lifecycle很是重要，用来开启支持热更新的，有了这个参数以后，prometheus.yml 配置文件只要更新了，经过执行localhost:9090/-/reload就会当即生效，因此必定要加上这个参数。

咱们这里将 prometheus.yml 文件对应的 ConfigMap 对象经过 volume 的形式挂载进了 Pod，这样 ConfigMap 更新后，对应的 Pod 里面的文件也会热更新的，而后咱们再执行上面的 reload 请求，Prometheus 配置就生效了，除此以外，为了将时间序列数据进行持久化，咱们将数据目录和一个 pvc 对象进行了绑定，因此咱们须要提早建立好这个 pvc 对象：(prometheus-volume.yaml)

apiVersion: v1 kind: PersistentVolume metadata: name: prometheus spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Recycle nfs: server: 10.151.30.57 path: /data/k8s --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: prometheus namespace: kube-ops spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi

咱们这里简单的经过 NFS 做为存储后端建立一个 pv、pvc 对象：

$ kubectl create -f prometheus-volume.yaml

除了上面的注意事项外，咱们这里还须要配置 rbac 认证，由于咱们须要在 prometheus 中去访问 Kubernetes 的相关信息，因此咱们这里管理了一个名为 prometheus 的 serviceAccount 对象：(prometheus-rbac.yaml)

apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-ops

因为咱们要获取的资源信息，在每个 namespace 下面都有可能存在，因此咱们这里使用的是 ClusterRole 的资源对象，值得一提的是咱们这里的权限规则声明中有一个nonResourceURLs的属性，是用来对非资源型 metrics 进行操做的权限声明，这个在之前咱们不多遇到过，而后直接建立上面的资源对象便可：

$ kubectl create -f prometheus-rbac.yaml
serviceaccount "prometheus" created clusterrole.rbac.authorization.k8s.io "prometheus" created clusterrolebinding.rbac.authorization.k8s.io "prometheus" created

还有一个要注意的地方是咱们这里必需要添加一个securityContext的属性，将其中的runAsUser设置为0，这是由于如今的 prometheus 运行过程当中使用的用户是 nobody，不然会出现下面的permission denied之类的权限错误：

level=error ts=2018-10-22T14:34:58.632016274Z caller=main.go:617 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

如今咱们就能够添加 promethues 的资源对象了：

$ kubectl create -f prometheus-deploy.yaml
deployment.extensions "prometheus" created $ kubectl get pods -n kube-ops NAME READY STATUS RESTARTS AGE prometheus-6dd775cbff-zb69l 1/1 Running 0 20m $ kubectl logs -f prometheus-6dd775cbff-zb69l -n kube-ops ...... level=info ts=2018-10-22T14:44:40.535385503Z caller=main.go:523 msg="Server is ready to receive web requests."

Pod 建立成功后，为了可以在外部访问到 prometheus 的 webui 服务，咱们还须要建立一个 Service 对象：(prometheus-svc.yaml)

apiVersion: v1 kind: Service metadata: name: prometheus namespace: kube-ops labels: app: prometheus spec: selector: app: prometheus type: NodePort ports: - name: web port: 9090 targetPort: http

为了方便测试，咱们这里建立一个NodePort类型的服务，固然咱们能够建立一个Ingress对象，经过域名来进行访问：

$ kubectl create -f prometheus-svc.yaml
service "prometheus" created $ kubectl get svc -n kube-ops NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus NodePort 10.111.118.104 <none> 9090:30987/TCP 24s

而后咱们就能够经过http://任意节点IP:30987访问 prometheus 的 webui 服务了。

prometheus webui

为了数据的一致性，prometheus 全部的数据都是使用的 UTC 时间，因此咱们默认打开的 dashboard 中有这样一个警告，咱们须要在查询的时候指定咱们当前的时间才能够。而后咱们能够查看当前监控系统中的一些监控目标：

因为咱们如今尚未配置任何的报警信息，因此 Alerts 菜单下面如今没有任何数据，隔一下子，咱们能够去 Graph 菜单下面查看咱们抓取的 prometheus 自己的一些监控数据了，其中- insert metrics at cursor -下面就是咱们搜集到的一些监控数据指标：

好比咱们这里就选择scrape_duration_seconds这个指标，而后点击Execute，若是这个时候没有查询到任何数据，咱们能够切换到Graph这个 tab 下面从新选择下时间，选择到当前的时间点，从新执行，就能够看到相似于下面的图表数据了：

除了简单的直接使用采集到的一些监控指标数据以外，这个时候也可使用强大的 PromQL 工具，PromQL其实就是 prometheus 便于数据聚合展现开发的一套 ad hoc 查询语言的，你想要查什么找对应函数取你的数据好了。