监控之美--prometheus配置文件动态管理

时间 2021-05-12

标签 node linux nginx web json windows bash 运维 ide 栏目 Linux 繁體版

原文原文链接

Prometheus是一套开源的监控、报警解决方案，是由SoundCloud公司开发的，从 2012 年开始编写代码，再到 2015 年开源以来，该项目有很是活跃的社区和开发人员，目前在全世界最大的男×××友社区上已经有了1.1w多star；2016 年 Prometheus 成为继 k8s 后，成为第二名 CNCF(Cloud Native Computing Foundation) 成员。node

Google SRE的书内也曾提到跟他们BorgMon监控系统类似的开源实现是Prometheus，做为新一×××源解决方案，不少理念与 Google SRE 运维之道不谋而合。做为新一代的监控解决方案，如今最多见的用法是与Kubernetes容器管理系统进行结合进行监控，但不要误解为它仅仅是一个容器的监控，当你深刻了解他以后，你会发现他能作不少事情。linux

这里我想多说一下，以前一直纠结于选择Prometheus仍是Open-falcon。这二者都是很是棒的新一代监控解决方案，后者是小米公司开源的，目前包括小米、金山云、美团、京东金融、赶集网等都在使用Open-Falcon，最大区别在于前者采用的是pull的方式获取数据，后者使用push的方式，暂且不说这两种方式的优缺点。简单说下我喜欢Prometheus的缘由，大概有5点吧，一、开箱即用，部署运维很是方便二、prometheus的社区很是活跃三、自带服务发现功能四、简单的文本存储格式，进行二次开发很是方便。五、最重要的一点，他的报警插件我很是喜欢，带有分组、报警抑制、静默提醒机制。这里并无贬低open-falcon的意思，仍是那句老话适合本身的才是最好的。nginx

Consul-template自动刷新配置文件web

因为Prometheus是“拉”的方式主动监测，因此须要在server端指定被监控节点的列表。当被监控的节点增多以后，每次增长节点都须要更改配置文件，很是麻烦，我这里用consul-template+consul动态生成配置文件，这种方式一样适用于其余须要频繁更改配置文件的服务。另一种解决方案是etcd+confd，基本如今主流的动态配置系统分这两大阵营。consul-template的定位和confd差很少，不过它是consul自家推出的模板系统。json

实现windows

先看下Prometheus的配置文件样例：bash

- job_name: 'node-exporter'
static_configs:
- targets: ['172.30.100.10:9100']
labels:
hostname: 'web1'
- targets: ['172.30.100.11:9100']
labels:
hostname: 'web2'
- targets: ['172.30.100.12:9100']
labels:
hostname: 'web3'

每次新加监控节点的时候，只须要添加一个新的targets便可，“hostname”是我自定义的一个label标签，方便区分。那么这里就产生一个问题，当targets的数量达到几百上千以后，配置文件看起来就会特别冗余。因此有经验的运维人就会想到用include的方式，把其余的配置文件包含进来，这样就把一个大而冗余的主配置文件，切分红一个个小的配置文件。Prometheus这里用的方法就是基于文件的服务发现--"file_sd_config"。我这里在prometheus下面新建了一个conf.d的目录，包含两个子配置文件，分别监控linux和windows的机器：运维

file_sd_config参考样例ide

子配置文件能够是YAML或JSON格式，我这里用的JSON格式，示例以下：ui

cat conf.d/lnode-discovery.json
[
{
"targets": ["172.30.100.2:9100"],
"labels": {
"hostname": "consul02"
}
},
{
"targets": ["172.30.100.1:9100"],
"labels": {
"hostname": "consul01"
}
}
]

结合服务发现实现文件的动态更新

有了子配置文件，新加监控节点的时候只须要更改子配置文件的内容便可。咱们能够预先定义一个子配置文件的模板，用consul-template渲染这个模板，实现文件的动态更新。具体方法以下：

一、下载consul-template

在https://releases.hashicorp.com/consul-template/这里找到你所须要操做系统版本，下载以后并解压：

# cd /data/consul_template #软件安装目录
# wget -c https://releases.hashicorp.com/consul-template/0.19.3/consul-template_0.19.3_linux_amd64.zip
# unzip consul-template_0.19.2_linux_amd64.zip
# mkdir templates # 建立consul-template的模板文件目录

consul-template继承了consul的简约风格，解压以后只有一个二进制软件包。咱们建立一个存放模板文件的目录，方便之后使用。

二、建立consul-template的配置文件

配置文件的格式遵循：HashiCorp Configuration Language。个人配置文件示例以下：

# cat consul-template.conf
log_level = "warn"
syslog {
# This enables syslog logging.
enabled = true
# This is the name of the syslog facility to log to.
facility = "LOCAL5"
}
consul {
# auth {
# enabled = true
# username = "test"
# password = "test"
# }
address = "172.30.100.45:8500"
# token = "abcd1234"
retry {
enabled = true
attempts = 12
backoff = "250ms"
# If max_backoff is set to 10s and backoff is set to 1s, sleep times
# would be: 1s, 2s, 4s, 8s, 10s, 10s, ...
max_backoff = "3m"
}
}
# This block defines the configuration for a template. Unlike other block
# this block may be specified multiple times to configure multiple templates.
template {
# This is the source file on disk to use as the input template. This is often
# called the "Consul Template template". This option is required if not using
# the `contents` option.
# source = "/path/on/disk/to/template.ctmpl"
source = "/data/consul_template/templates/lnode-discovery.ctmpl"
# This is the destination path on disk where the source template will render.
# If the parent directories do not exist, Consul Template will attempt to
# create them.
# destination = "/path/on/disk/where/template/will/render.txt"
destination = "/data/prometheus/prometheus-1.7.1.linux-amd64/conf.d/lnode-discovery.json"
# This is the optional command to run when the template is rendered. The
# command will only run if the resulting template changes. The command must
# return within 30s (configurable), and it must have a successful exit code.
# Consul Template is not a replacement for a process monitor or init system.
command = ""
# This is the maximum amount of time to wait for the optional command to
# return. Default is 30s.
command_timeout = "60s"
# This option backs up the previously rendered template at the destination
# path before writing a new one. It keeps exactly one backup. This option is
# useful for preventing accidental changes to the data without having a
# rollback strategy.
backup = true
# This is the `minimum(:maximum)` to wait before rendering a new template to
# disk and triggering a command, separated by a colon (`:`). If the optional
# maximum value is omitted, it is assumed to be 4x the required minimum value.
# This is a numeric time with a unit suffix ("5s"). There is no default value.
# The wait value for a template takes precedence over any globally-configured
# wait.
left_delimiter = "{$"
right_delimiter = "$}"
wait {
min = "2s"
max = "20s"
}
}
template {
source = "/data/consul_template/templates/wnode-discovery.ctmpl"
destination = "/data/prometheus/prometheus-1.7.1.linux-amd64/conf.d/wnode-discovery.json"
command = ""
backup = true
command_timeout = "60s"
left_delimiter = "{$"
right_delimiter = "$}"
wait {
min = "2s"
max = "20s"
}
}

主要配置参数：

syslog: 启用syslog，这样服务日志能够记录到syslog里。

consul: 这里须要设置consul服务发现的地址，我这里无需认证，因此把auth注释了。consul服务的搭建能够参考我以前的文章。值得一提的是，backoff和max_backoff选项，backoff设置时间间隔，当未从consul获取到数据时会进行重试，并以2的倍数的时间间隔进行。好比设置250ms，重试5次，那么每次的时间间隔为：250ms,500ms,1s,2s,4s，直到达到max_backoff的阀值；若是max_backoff设为2s，那么第五次重试的时候仍是间隔2s，即250ms,500ms,1s,2s,2s。

template：定义模板文件位置。主要选项是source，destination和command，当backup=true的时候，会备份上一次的配置，并以bak后缀结尾。

source：consul-template的模板文件，用来进行渲染的源文件。
destination：consul-template的模板被渲染以后的文件位置。好比这里便是我prometheus基于文件发现的子配置文件位置:/data/prometheus/prometheus-1.7.1.linux-amd64/conf.d/下的文件。
command:文件渲染成功以后须要执行的命令。prometheus这里会自动发现文件的更改，因此我这里无需任何命令，给注释掉了。像nginx、haproxy之类的服务，通常更改完配置文件以后都须要重启，这里能够设置“nginx -s reload”之类的命令。
command_timeout：设置上一步command命令执行的超时时间。
left_delimiter和right_delimiter：模板文件中分隔符。默认是用“{{}}”设置模板，当产生冲突的时候能够更改这里的设置。好比我这里因为用ansible去推送的模板文件，“{{}}”符号与Jinja2的语法产生了冲突，因此改成了“{$$}”符号。

当有多个模板须要渲染的时候，这里能够写多个template。

三、服务启动

启动consul-template服务，指定配置文件。

#./consul-template -config ./consul-template.conf

四、模板渲染

根据目标文件的格式去渲染consul-template的模板，好比我这里的prometheus基于文件的服务发现模板以下：

cat templates/lnode-discovery.ctmpl
[
{$ range tree "prometheus/linux" $}
{
"targets": ["{$ .Value $}"],
"labels": {
"hostname": "{$ .Key $}"
}
},
{$ end $}
{
"targets": ["172.30.100.1:9100"],
"labels": {
"hostname": "consul01"
}
}
]

循环读取consul的K/V存储prometheus/linux/目录下的值，"targets"取的是Key，hostname取的是Key的值。

Consul的K/V存储示例以下，每次录入一个数据，便是对应prometheus配置文件里的"hostname:targets"：

consul K/V示例

这里有一个小技巧：prometheus的配置文件里，多个targets是用逗号“,”分割的，而最后的那一个targets后面不能带逗号，因此我在模板文件里单独写了一个targets，这样就无需关心这一例外状况。

五、数据在线添加实现配置文件的动态更新

如今在打开consul的ui界面，默认是8500端口，在KEY/VALUE的prometheus/linux/目录下新加一个consul0二、consul03...，最后生成的配置文件格式以下：

至此，prometheus基于文件的服务发现，初步完成。