Ganglia汇总监控搭建和配置详解

时间 2020-07-06

标签 ganglia 汇总监控搭建配置详解繁體版

原文原文链接

致

linuxidc.com和linuxso.com与其余复制粘贴的编辑，我做为一个开源世界的爱好者和贡献者，本着开源的精神，并不反对大家转载个人文章，既然写出来就是想与你们分享和交流知识。可是，但愿大家也能本着开源的精神，在转载的时候写明原做者和出处，请不要将版权写上来源是Linux公社，即便是Linux的GPL协议也是有版权保护的。我相信大家不会把Linux kernel源代码写上来源是Linux公社，那为什么对于其余内容就执行双重标准呢？

#---------------------------------

Ganglia是加州伯克利大学千禧计划的其中一个开源项目，以BSD协议分发。是一个集群汇总监控用的的软件，和不少人熟知的Cacti不一样，cacti是详细监控集群中每台服务器的运行状态，而Ganglia是将集群中的服务器数据进行汇总而后监控。有时经过cacti或者zabbix看不出来的集群整体负载问题，却可以在Ganglia中体现，其集群的熵图我我的认为是个挺亮点的东西，一眼就明确集群的负载情况。中文翻译叫神经中枢，一目了然，言简意赅。

如下内容分为3个部分，ganglia的编译和初始配置，web展示的部署，分组监控的配置方法。

1、Ganglia的编译和配置

1.Ganglia基本概念

ganglia分为服务器端和客户端，编译后文件名是gmetad和gmond，其中gmetad是服务器端，gmond是客户端，服务器端只有一个，而被监控服务器均安装客户端。颇有意思的是，Ganglia采用Internet IPv4 类D地址中的的组播进行数据请求。我猜可能主要是为了实现一对多节省带宽的须要。其实现原理应该是gmetad发送一个请求到一个组播地址，因为是组播地址，因此gmetad只需发送一次请求包便可完成对全部gmond的轮询。（若是是单播，则Ganglia须要向每台服务器均发送一次轮询请求，这样的话，集群数量多了，主服务器光发送就会占用不小的带宽。而Ganglia自己是为大规模集群所作的HPC而生的，若是占用很高的带宽和占用很大量的CPU资源去处理网络数据就不符合其设计理念了。）而后gmond经过这个请求将采集到的数据返回给gmetad，gmetad将数据保存在rrd数据库中，而后经过web界面绘图展现。

2.Ganglia编译

编译其实有点复杂，依赖的东西比较多，主要有rrdtool，这个用过cacti应该不陌生；expat；confuse；python；apr开发包；PCRE。

ganglia编译分为两种状况，服务器端和客户端。

我也不说那么复杂了，直接给脚本，复制粘贴就好了。前提是你已经编译安装了rrdtool到/opt/rrdtool文件夹，若是是别的，自行修改脚本路径就行了。

server端脚本

#!/bin/sh
yum install -y expat expat-devel pcre pcre-devel
wget http://mirror.bit.edu.cn/apache/apr/apr-1.4.6.tar.gz
tar zxf apr-1.4.6.tar.gz
cd apr-1.4.6
./configure;make;make install
cd ..
wget http://download.savannah.gnu.org/releases/confuse/confuse-2.7.tar.gz
tar zxf confuse-2.7.tar.gz
cd confuse-2.7
./configure CFLAGS=-fPIC --disable-nls ;make;make install
cd ..
wget http://downloads.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.3.1/ganglia-3.3.1.tar.gz
tar zxf ganglia-3.3.1.tar.gz
cd ganglia-3.3.1
#server
./configure --prefix=/opt/modules/ganglia --with-static-modules --enable-gexec --enable-status --with-gmetad --with-python=/usr --with-librrd=/opt/rrdtool-1.4.5 --with-libexpat=/usr --with-libconfuse=/usr/local --with-libpcre=/usr/local
#client
#./configure --prefix=/opt/modules/ganglia --enable-gexec --enable-status --with-python=/usr --with-libapr=/usr/local/apr/bin/apr-1-config --with-libconfuse=/usr/local --with-libexpat=/usr --with-libpcre=/usr
make; make install
cd gmetad
cp gmetad.conf /opt/modules/ganglia/etc/
cp gmetad.init /etc/init.d/gmetad
sed -i "s/^GMETAD=\/usr\/sbin\/gmetad/GMETAD=\/opt\/modules\/ganglia\/sbin\/gmetad/g" /etc/init.d/gmetad
chkconfig --add gmetad
ip route add 239.2.11.71 dev eth1
service gmetad start

因为我安装在/opt/modules/ganglia下面，因此用sed替换掉启动文件gmetad中的启动项。路由须要加上，也就是ip route，我指向到了内网的网卡上。gmond一样这样作路由。

客户端

#!/bin/sh
yum install -y expat expat-devel pcre pcre-devel
wget http://mirror.bit.edu.cn/apache/apr/apr-1.4.6.tar.gz
tar zxf apr-1.4.6.tar.gz
cd apr-1.4.6
./configure;make;make install
cd ..
wget http://download.savannah.gnu.org/releases/confuse/confuse-2.7.tar.gz
tar zxf confuse-2.7.tar.gz
cd confuse-2.7
./configure CFLAGS=-fPIC --disable-nls ;make;make install
cd ..
wget http://downloads.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.3.1/ganglia-3.3.1.tar.gz
tar zxf ganglia-3.3.1.tar.gz
cd ganglia-3.3.1
#server
#./configure --prefix=/opt/modules/ganglia --with-static-modules --enable-gexec --enable-status --with-gmetad --with-python=/usr --with-librrd=/opt/rrdtool-1.4.5 --with-libexpat=/usr --with-libconfuse=/usr/local --with-libpcre=/usr/local
#client
./configure --prefix=/opt/modules/ganglia --enable-gexec --enable-status --with-python=/usr --with-libapr=/usr/local/apr/bin/apr-1-config --with-libconfuse=/usr/local --with-libexpat=/usr --with-libpcre=/usr
make; make install
cd gmond
./gmond -t > /opt/modules/ganglia/etc/gmond.conf
cp gmond.init /etc/init.d/gmond
sed -i "s/^GMOND=\/usr\/sbin\/gmond/GMOND=\/opt\/modules\/ganglia\/sbin\/gmond/g" /etc/init.d/gmond
chkconfig --add gmond
ip route add 239.2.11.71 dev eth1
service gmond start

客户端是不须要rrdtool的，且客户端的配置文件是须要用一个命令生成的。

服务器只装一个，客户端用脚本分发下去各自安装就行了。

2、web展示部分的部署

就是基本的nginx,php环境，全部的php文件在ganglia源文件路径下的web文件夹下。进去SRC目录，把web文件夹整个复制出来，再把nginx指向到那个文件夹就能够了。

第一次访问web界面可能会报错，须要修改几个文件到正确指向上，一个是能够看一下个人diff

[root@portal-lc-209 html]# diff conf_default.php conf_default.php.in
29c29
< $conf['gmetad_root'] = "/opt/modules/ganglia/html";
---
> $conf['gmetad_root'] = "@varstatedir@/ganglia";
46c46
< $conf['rrdtool'] = "/opt/rrdtool-1.4.5/bin/rrdtool";
---
> $conf['rrdtool'] = "/usr/bin/rrdtool";

好像还有几个文件须要修改，可是时间过久记不清了，多是eval_conf.php和header.php，按照报错提示修改就行了。没有太难的东西。

3、集群的分组部署。

网上Ganglia讲安装配置的文章不少，可是讲分组配置的不多。其实这个很重要，默认配置下，Ganglia会把全部东西放在一个Grid里面，也就是一个网格。大的集群，不分组。可是真实的服务器集群有各类功能，每一个群分管不一样的事务，全放一块儿就太乱了。也很差识别，因此须要分组使用。

其实Ganglia的分组很简单，就是分端口，不一样的组配置不一样的监听端口就完事了。

个人gmetad.conf是这样配置的。

gmetad

data_source "Namenode" 192.168.1.28:8653
data_source "Datanode" 192.168.1.27:8649
data_source " Portal" 192.168.1.43: 8650
data_source "Collector" 192.168.1.35:8651
data_source " DB" 192.168.1.51: 8652

gridname "Hadoop"
rrd_rootdir "/opt/modules/ganglia/html/rrds"
#配置rrd数据保存文件的路径，给web界面用的，这个是固定的，最好放在web文件夹下，并赋予正确的权限
case_sensitive_hostnames 0

数据来源有5个，这5个分别是每一个组的组长，至关于一道杠。可是组长是不须要配置gmetad的，除非你要作多级组播收集数据。每一个组长只须要分配不一样的端口号就能够了。你可能会问，IP不同，端口同样不行吗？不行，由于这个IP是单播IP，至关于一个路由指向，而Ganglia实际的数据传输是在多播IP上进行的，而多播IP只有一个。在客户端配置，若是你须要多级gmetad，能够配多个多播IP。

客户端配置就比较复杂一些了。我只贴上须要修改的部分，其余都是默认就能够了

gmond

cluster {
    name = " Portal "
#对应gmetad中的Portal，名称必定要写对。
    owner = "unspecified"
    latlong = "unspecified"
    url = "unspecified"
}
/* Feel free to specify as many udp_send_channels as you like.    Gmond
     used to only support having a single channel */
udp_send_channel {
    #bind_hostname = yes # Highly recommended, soon to be default.
                                             # This option tells gmond to use a source address
                                             # that resolves to the machine's hostname.    Without
                                             # this, the metrics may appear to come from any
                                             # interface and the DNS names associated with
                                             # those IPs will be used to create the RRDs.
    mcast_join = 239.2.11.71
    port = 8650
#gmetad中的Portal所分配的端口号。
    ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
    mcast_join = 239.2.11.71
    port = 8650
    bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
     an xml description of the state of the cluster */
tcp_accept_channel {
    port = 8650
}

红色部分就是Portal小组的端口，从gmetad.conf中能够看到，Portal小组属于8650端口，那么相应的在gmond中，也要将udp和tcp端口写为8650。

若是是另一个组的，就写上在gmetad中配置的那个端口。固然，你能够把这个端口号想像为小组的代号。可能更好理解一些。

再加上另一个组的成员gmond就更容易理解了

cluster {
    name = " DB"
    owner = "unspecified"
    latlong = "unspecified"
    url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
    location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.    Gmond
     used to only support having a single channel */
udp_send_channel {
    #bind_hostname = yes # Highly recommended, soon to be default.
                                             # This option tells gmond to use a source address
                                             # that resolves to the machine's hostname.    Without
                                             # this, the metrics may appear to come from any
                                             # interface and the DNS names associated with
                                             # those IPs will be used to create the RRDs.
    mcast_join = 239.2.11.71
    port = 8652
    ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
    mcast_join = 239.2.11.71
    port = 8652
    bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
     an xml description of the state of the cluster */
tcp_accept_channel {
    port = 8652
}

红色对红色，蓝色对蓝色。一目了然。

附监控效果图：

40台服务器，228颗CPU，共计900G内存，网络流量峰值总计300M字节左右。

Heatmap熵图，某集群整体负载状况。

汇总监控Ganglia就是这样了。