Prometheus Operator 教程：根据服务维度对 Prometheus 分片

时间 2021-06-21

标签 node linux git github web 数据库 api bash 网络 session 栏目 Linux 繁體版

原文原文链接

原文连接： https://fuckcloudnative.io/posts/aggregate-metrics-user-prometheus-operator/

Promtheus 自己只支持单机部署，没有自带支持集群部署，也不支持高可用以及水平扩容，它的存储空间受限于本地磁盘的容量。同时随着数据采集量的增长，单台 Prometheus 实例可以处理的时间序列数会达到瓶颈，这时 CPU 和内存都会升高，通常内存先达到瓶颈，主要缘由有：node

Prometheus 的内存消耗主要是由于每隔 2 小时作一个 Block 数据落盘，落盘以前全部数据都在内存里面，所以和采集量有关。
加载历史数据时，是从磁盘到内存的，查询范围越大，内存越大。这里面有必定的优化空间。
一些不合理的查询条件也会加大内存，如 Group 或大范围 Rate。

这个时候要么加内存，要么经过集群分片来减小每一个实例须要采集的指标。本文就来讨论经过 Prometheus Operator 部署的 Prometheus 如何根据服务维度来拆分实例。linux

1. 根据服务维度拆分 Prometheus

Prometheus 主张根据功能或服务维度进行拆分，即若是要采集的服务比较多，一个 Prometheus 实例就配置成仅采集和存储某一个或某一部分服务的指标，这样根据要采集的服务将 Prometheus 拆分红多个实例分别去采集，也能必定程度上达到水平扩容的目的。git

在 Kubernetes 集群中，咱们能够根据 namespace 来拆分 Prometheus 实例，例如将全部 Kubernetes 集群组件相关的监控发送到一个 Prometheus 实例，将其余全部监控发送到另外一个 Prometheus 实例。github

Prometheus Operator 经过 CRD 资源名 Prometheus 来控制 Prometheus 实例的部署，其中能够经过在配置项 serviceMonitorNamespaceSelector 和 podMonitorNamespaceSelector 中指定标签来限定抓取 target 的 namespace。例如，将 namespace kube-system 打上标签 monitoring-role=system，将其余的 namespace 打上标签 monitoring-role=others。web

2. 告警规则拆分

将 Prometheus 拆分红多个实例以后，就不能再使用默认的告警规则了，由于默认的告警规则是针对全部 target 的监控指标的，每个 Prometheus 实例都没法获取全部 target 的监控指标，势必会一直报警。为了解决这个问题，须要对告警规则进行拆分，使其与每一个 Prometheus 实例的服务维度一一对应，按照上文的拆分逻辑，这里只须要拆分红两个告警规则，打上不一样的标签，而后在 CRD 资源 Prometheus 中经过配置项 ruleSelector 指定规则标签来选择相应的告警规则。数据库

3. 集中数据存储

解决了告警问题以后，还有一个问题，如今监控数据比较分散，使用 Grafana 查询监控数据时咱们也须要添加许多数据源，并且不一样数据源之间的数据还不能聚合查询，监控页面也看不到全局的视图，形成查询混乱的局面。api

为了解决这个问题，咱们可让 Prometheus 不负责存储数据，只将采集到的样本数据经过 Remote Write 的方式写入远程存储的 Adapter，而后将 Grafana 的数据源设为远程存储的地址，就能够在 Grafana 中查看全局视图了。这里选择 VictoriaMetrics 来做为远程存储。VictoriaMetrics 是一个高性能，低成本，可扩展的时序数据库，能够用来作 Prometheus 的长期存储，分为单机版本和集群版本，均已开源。若是数据写入速率低于每秒一百万个数据点，官方建议使用单节点版本而不是集群版本。本文做为演示，仅使用单机版本，架构如图：bash

4. 实践

肯定好了方案以后，下面来进行动手实践。网络

部署 VictoriaMetrics

首先部署一个单实例的 VictoriaMetrics，完整的 yaml 以下：session

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: victoriametrics
  namespace: kube-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: victoriametrics
  name: victoriametrics
  namespace: kube-system
spec:
  serviceName: pvictoriametrics
  selector:
    matchLabels:
      app: victoriametrics
  replicas: 1
  template:
    metadata:
      labels:
        app: victoriametrics
    spec:
      nodeSelector:
        blog: "true"
      containers:    
      - args:
        - --storageDataPath=/storage
        - --httpListenAddr=:8428
        - --retentionPeriod=1
        image: victoriametrics/victoria-metrics
        imagePullPolicy: IfNotPresent
        name: victoriametrics
        ports:
        - containerPort: 8428
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /health
            port: 8428
          initialDelaySeconds: 30
          timeoutSeconds: 30
        livenessProbe:
          httpGet:
            path: /health
            port: 8428
          initialDelaySeconds: 120
          timeoutSeconds: 30
        resources:
          limits:
            cpu: 2000m
            memory: 2000Mi
          requests:
            cpu: 2000m
            memory: 2000Mi
        volumeMounts:
        - mountPath: /storage
          name: storage-volume
      restartPolicy: Always
      priorityClassName: system-cluster-critical
      volumes:
      - name: storage-volume
        persistentVolumeClaim:
          claimName: victoriametrics
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: victoriametrics
  name: victoriametrics
  namespace: kube-system
spec:
  ports:
  - name: http
    port: 8428
    protocol: TCP
    targetPort: 8428
  selector:
    app: victoriametrics
  type: ClusterIP

有几个启动参数须要注意：

storageDataPath : 数据目录的路径。 VictoriaMetrics 将全部数据存储在此目录中。
retentionPeriod : 数据的保留期限（以月为单位）。旧数据将自动删除。默认期限为1个月。
httpListenAddr : 用于监听 HTTP 请求的 TCP 地址。默认状况下，它在全部网络接口上监听端口 8428。

给 namespace 打标签

为了限定抓取 target 的 namespace，咱们须要给 namespace 打上标签，使每一个 Prometheus 实例只抓取特定 namespace 的指标。根据上文的方案，须要给 kube-system 打上标签 monitoring-role=system：

$ kubectl label ns kube-system monitoring-role=system

给其余的 namespace 打上标签 monitoring-role=others。例如：

$ kubectl label ns monitoring monitoring-role=others
$ kubectl label ns default monitoring-role=others

拆分 PrometheusRule

告警规则须要根据监控目标拆分红两个 PrometheusRule。具体作法是将 kube-system namespace 相关的规则整合到一个 PrometheusRule 中，并修更名称和标签：

# prometheus-rules-system.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: system
    role: alert-rules
  name: prometheus-system-rules
  namespace: monitoring
spec:
  groups:
...
...

剩下的放到另一个 PrometheusRule 中，并修更名称和标签：

# prometheus-rules-others.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: others
    role: alert-rules
  name: prometheus-others-rules
  namespace: monitoring
spec:
  groups:
...
...

而后删除默认的 PrometheusRule：

$ kubectl -n monitoring delete prometheusrule prometheus-k8s-rules

新增两个 PrometheusRule：

$ kubectl apply -f prometheus-rules-system.yaml
$ kubectl apply -f prometheus-rules-others.yaml

若是你实在不知道如何拆分规则，或者不想拆分，想作一个伸手党，能够看这里：

拆分 Prometheus

下一步是拆分 Prometheus 实例，根据上面的方案须要拆分红两个实例，一个用来监控 kube-system namespace，另外一个用来监控其余 namespace：

# prometheus-prometheus-system.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: system 
  name: system
  namespace: monitoring
spec:
  remoteWrite:
    - url: http://victoriametrics.kube-system.svc.cluster.local:8428/api/v1/write
      queueConfig:
        maxSamplesPerSend: 10000
  retention: 2h 
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  image: quay.io/prometheus/prometheus:v2.17.2
  nodeSelector:
    beta.kubernetes.io/os: linux
  podMonitorNamespaceSelector:
    matchLabels:
      monitoring-role: system 
  podMonitorSelector: {}
  replicas: 1 
  resources:
    requests:
      memory: 400Mi
    limits:
      memory: 2Gi
  ruleSelector:
    matchLabels:
      prometheus: system 
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: 
    matchLabels:
      monitoring-role: system 
  serviceMonitorSelector: {}
  version: v2.17.2
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: others
  name: others
  namespace: monitoring
spec:
  remoteWrite:
    - url: http://victoriametrics.kube-system.svc.cluster.local:8428/api/v1/write
      queueConfig:
        maxSamplesPerSend: 10000
  retention: 2h
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  image: quay.io/prometheus/prometheus:v2.17.2
  nodeSelector:
    beta.kubernetes.io/os: linux
  podMonitorNamespaceSelector: 
    matchLabels:
      monitoring-role: others 
  podMonitorSelector: {}
  replicas: 1
  resources:
    requests:
      memory: 400Mi
    limits:
      memory: 2Gi
  ruleSelector:
    matchLabels:
      prometheus: others 
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector:
    matchLabels:
      monitoring-role: others 
  serviceMonitorSelector: {}
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  version: v2.17.2

须要注意的配置：

经过 remoteWrite 指定 remote write 写入的远程存储。
经过 ruleSelector 指定 PrometheusRule。
限制内存使用上限为 2Gi，可根据实际状况自行调整。
经过 retention 指定数据在本地磁盘的保存时间为 2 小时。由于指定了远程存储，本地不须要保存那么长时间，尽可能缩短。
Prometheus 的自定义配置能够经过 additionalScrapeConfigs 在 others 实例中指定，固然你也能够继续拆分，放到其余实例中。

删除默认的 Prometheus 实例：

$ kubectl -n monitoring delete prometheus k8s

建立新的 Prometheus 实例：

$ kubectl apply -f prometheus-prometheus.yaml

查看运行情况：

$ kubectl -n monitoring get prometheus
NAME     VERSION   REPLICAS   AGE
system   v2.17.2   1          29h
others   v2.17.2   1          29h

$ kubectl -n monitoring get sts
NAME                READY   AGE
prometheus-system   1/1     29h
prometheus-others   1/1     29h
alertmanager-main   1/1     25d

查看每一个 Prometheus 实例的内存占用：

$ kubectl -n monitoring top pod -l app=prometheus
NAME                  CPU(cores)   MEMORY(bytes)
prometheus-others-0   12m          110Mi
prometheus-system-0   121m         1182Mi

最后还要修改 Prometheus 的 Service，yaml 以下：

apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: system 
  name: prometheus-system
  namespace: monitoring
spec:
  ports:
  - name: web
    port: 9090
    targetPort: web
  selector:
    app: prometheus
    prometheus: system
  sessionAffinity: ClientIP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: others
  name: prometheus-others
  namespace: monitoring
spec:
  ports:
  - name: web
    port: 9090
    targetPort: web
  selector:
    app: prometheus
    prometheus: others
  sessionAffinity: ClientIP

删除默认的 Service：

$ kubectl -n monitoring delete svc prometheus-k8s

建立新的 Service：

$ kubectl apply -f prometheus-service.yaml

修改 Grafana 数据源

Prometheus 拆分红功以后，最后还要修改 Grafana 的数据源为 VictoriaMetrics 的地址，这样就能够在 Grafana 中查看全局视图，也能聚合查询。

打开 Grafana 的设置页面，将数据源修改成 http://victoriametrics.kube-system.svc.cluster.local:8428：

点击 Explore 菜单：

在查询框内输入 up，而后按下 Shift+Enter 键查询：

能够看到查询结果中包含了全部的 namespace。

若是你对个人 Grafana 主题配色很感兴趣，能够关注公众号『云原生实验室』，后台回复 grafana 便可获取秘诀。

写这篇文章的原由是个人 k3s 集群每台节点的资源很紧张，并且监控的 target 不少，致使 Prometheus 直接把节点的内存资源消耗完了，不停地 OOM。为了充分利用个人云主机，不得不另谋他路，这才有了这篇文章。

Kubernetes 1.18.2 1.17.5 1.16.9 1.15.12离线安装包发布地址http://store.lameleg.com ，欢迎体验。使用了最新的sealos v3.3.6版本。做了主机名解析配置优化，lvscare 挂载/lib/module解决开机启动ipvs加载问题，修复lvscare社区netlink与3.10内核不兼容问题,sealos生成百年证书等特性。更多特性 https://github.com/fanux/sealos 。欢迎扫描下方的二维码加入钉钉群，钉钉群已经集成sealos的机器人实时能够看到sealos的动态。