Prometheus快速入门

时间 2019-11-26

标签 prometheus 快速入门繁體版

原文原文链接

Prometheus是一个开源的，基于metrics(度量)的一个开源监控系统，它有一个简单而强大的数据模型和查询语言，让咱们分析应用程序。Prometheus诞生于2012年主要是使用go语言编写的，并在Apache2.0许可下得到许可，目前有大量的组织正在使用Prometheus在生产。2016年，Prometheus成为云计算组织(CNCF)第二个成员。node

Prometheus部署

建立 prometheus用户mysql

下载对应平台的安装包解压的目录linux

hostname$ tar xf prometheus-2.10.0.linux-amd64.tar.gz
hostname$ mv prometheus-2.10.0.linux-amd64 /opt/

启动脚本web

hostname$ sudo vim  /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus instance
Wants=network-online.target
After=network-online.target
After=postgresql.service mariadb.service mysql.service

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
WorkingDirectory=/opt/prometheus/
RuntimeDirectory=prometheus
RuntimeDirectoryMode=0750
ExecStart=/opt/prometheus/prometheus  \
--storage.tsdb.retention=15d \
--config.file=/opt/prometheus/prometheus.yml  \
--web.max-connections=512  \
--web.read-timeout=5m  \
--storage.tsdb.path="/opt/data/prometheus" \
--query.timeout=2m \
 --query.max-concurrency=200
LimitNOFILE=10000
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target

启动脚本

启动参数说明sql

--web.read-timeout=5m 请求链接的最大等待时间，防止太多的空闲连接，占用资源
--web.max-connections=512 最大连接数
--storage.tsdb.retention=15d prometheus开始采集监控数据后会存在内存中和硬盘中，太长的话，硬盘和内存都吃不消，过短的话，历史数据就没有了，设置15天为宜
--storage.tsdb.path="/opt/data/prometheus 存储数据路径，这个很重要，不要随便放在一个地方，会把/根目录塞满
--query.timeout=2m --query.max-concurrency=200 防止太多的用户同时查询，也防止单个用户执行过大的查询而一直不退出

配置文件vim

# my global config
global:
  scrape_interval:     15s #设置采集数据的频率，默认是1分钟.
  evaluation_interval: 15s #每15秒评估一次规则。默认值是每1分钟一次
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    static_configs:
    - targets: ['192.168.48.130:9090']  # 设置本机的ip

/opt/prometheus/prometheus.yml

浏览器访问9090端口，Prometheus已经正常运行了浏览器

Node_exporter部署

Prometheus社区为咱们提供了 node_exporter程序来采集被监控端的系统信息，下载在c1.heboan.com 节点上进行部署服务器

建立 prometheus用户网络

下载对应平台的安装包解压的目录curl

hostname$ tar xf node_exporter-0.18.1.linux-amd64.tar.gz
hostname$ mv node_exporter-0.18.1.linux-amd64 /opt/node_exporter

hostname$ sudo vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/opt/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动脚本

node_exporter默认监听9100端口提供http服务

$ curl http://c1.heboan.com:9100/metrics
....
# HELP node_memory_MemFree_bytes Memory information field MemFree_bytes.   #这行表示说明监控项的意思
# TYPE node_memory_MemFree_bytes gauge  #这行说明监控的是数据类型是gauuge
node_memory_MemFree_bytes 1.619521536e+09   #这行是监控项 k/v


node_export搜集了不少监控项，每一个监控项都有这三行

node-exporter配置好了之后，咱们就须要把它接入到 prometheus 中去，修改prometheus.yml，而后重启prometheus

scrape_configs:
  ...
  - job_name: 'aliyun'
    static_configs:
    - targets: ['c1.heboan.com:9100']   #这里能够写多个node_export地址

而后访问prometheus web界面，能够看到c1.heboan.com的已经被监控上了

按以上步骤把 c2.heboan.com也监控上

查看监控数据

上面咱们已经把c1.heboan.com机器部署了node_export来采集系统信息，而且接入到了prometheus , 如今咱们能够在prometheus web 界面经过查询语言来获取咱们想要的监控项数据

举个栗子： 获取被监控端5分钟内cpu使用率

计算公式： (1-5分钟空闲增量 / 5分钟总增量) * 100

首先查出cpu工做运行的全部时间， cpu是分了system、iowait、irq、user、idle...这些加起来的时间就是运行的总时间，并且咱们看到这些是按每核来计算的

根据label过滤出idle(空闲时间)

计算出5分钟内的增量

由于这是分开多核计算，因此咱们须要把它用sum加起来

虽然加起来了，可是这是把全部机器的全部核加起来了，而咱们须要时把属于一台机器的全部核心加起来，所以咱们须要用到by()

上面已经算出了5 分钟内idle(CPU空闲)的增量，那cpu总的时间增量就是

#不加过滤条件
sum(increase(node_cpu_seconds_total[5m])) by (instance)

再根据公式计算便可

能够点击Graph查看图标走势

这里的图表都是零时的，咱们要向保存想下，随时想看，就能够用Grafana

Grafana部署使用

安装Grafana

# 官网下载安装包， 例如： grafana-6.2.5-1.x86_64.rpm 
# 而后本地安装
yum localinstall grafana-6.2.5-1.x86_64.rpm

# 启动
systemctl start grafana-server

Grafana监听端口是3000，咱们访问web 界面, 默认的帐号密码是admin/admin, 登陆后会要求修改密码，自行修改便可, 登陆进入以后点击 "Add data source" 添加数据源，选择"prometheus"

添加一个dashboard ，回到Home Dashboard 点击"New dashboard"---"Add Query"

点击齿轮图标，进入面板设置，来添加变量

General

设置几个变量

$interval

$env

$node

保存面板后查看，效果以下

如今咱们来画图，cpu的使用率，点击 add_panel图标--选择 "Choose Visualization"

数据源选择prometheus,以前咱们配置的数据源， query语句以下

 (1- sum(increase(node_cpu_seconds_total{mode="idle", instance=~"$node"}[5m]))/ sum(increase(node_cpu_seconds_total{instance=~"$node"}[5m]))) * 100

#$node是咱们以前配置的变量，来匹配每一个节点

Visualization

General

最后查看效果以下

Alertmanager告警

有了监控项后，还不够，当监控项出现问题后还须要发出告警通知，这个功能须要Alertmanager角色来处理

prometheus是由咱们决定什么状况下该报警，而后prometheus发出警报，被发送给Alermanager， Alertmanager接受到警报后将警报进行分组节流进行通知

首先咱们先在prometheus server 上配置警报规则

...
rule_files:
  - "first_rules.yml"

...

/opt/prometheus/prometheus.yml

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr:  1 - sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / sum(increase(node_cpu_seconds_total[5m])) by (instance) > 0.8
    for: 1m


#当5分钟内cpu使用率大于80%而且持续1分钟出发警报

进行cpu压测，由于是双核的，因此打开2个c1.heboan.com的终端，执行如下命令压测

time echo "scale=50000; 4*a(1)" | bc -l -q

查看下图标

看下prometheus web界面已经出发警报了

要想进行告警通知，好比邮件，咱们就要用到Alertmanager了。我在prometheus那台服务器上安装Alertmanager, 实际上它能够安装在任何其余地方，主要网络OK就行

hostname$ tar xf alertmanager-0.17.0.linux-amd64.tar.gz 
hostname$ mv  alertmanager-0.17.0 /opt/ alertmanager

/opt/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: 'sellsa@qq.com'
  smtp_auth_username: 'sellsa@qq.com'
  smtp_auth_password: '邮箱受权码'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'email'


receivers:
- name: 'email'
  email_configs:
  - to: 'heboan@qq.com'

启动Alertmanage, 它监听9093端口

cd /opt/alertmanager
./alertmanager --config.file="alertmanager.yml"

prometheus.yml配置警报推送到那个Alertmanager

...
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

...

最后在进行cpu压测，咱们就能够收到告警邮件了

这个告警信息貌似并不相信，没有具体的描述信息，咱们能够修改下first_rules.yml添加些信息

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr:  1 - sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / sum(increase(node_cpu_seconds_total[5m])) by (instance) > 0.8
    for: 1m
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} cpu 5分钟内使用率超过80%,而且持续1分钟'
      summary: 'Instance {{ $labels.instance }}'

1. Prometheus入门教程（三）：Grafana 图表配置快速入门
2. ES6快速入门 ES6 快速入门
3. 快速入门
4. Prometheus快速了解
5. prometheus快速启动
6. 无监控不运维——Prometheus 快速入门
7. Hadoop快速入门
8. Sqoop 快速入门
9. Shell快速入门
10. vim快速入门
更多相关文章...
• SQL 快速参考 - SQL 教程
• Eclipse 快速修复 - Eclipse 教程
• YAML 入门教程
• Java Agent入门实战（一）-Instrumentation介绍与使用