centOS6.5 heartbeatV3+pacemaker实现高可用集群

时间 2020-07-24

标签 centos6.5 centos heartbeatv3+pacemaker heartbeatv pacemaker 实现可用集群栏目 CentOS 繁體版

原文原文链接

1.集群环境
html

node1：192.168.220.111node

node2：192.168.220.112linux

2.准备工做bash

配置各节点SSH互信：网络

# node1
ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ''
ssh-copy-id -i .ssh/id_rsa.pub root@192.168.220.112
# node2
ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ''
ssh-copy-id -i .ssh/id_rsa.pub root@192.168.220.111

配置主机名称与uname -n一致，并经过/etc/hosts解析：dom

# node1
hostname node1.wyb.com
sed -i 's/localhost.localdomain/node1.wyb.com/g' /etc/sysconfig/network
echo '192.168.220.111 node1.wyb.com   node1' >> /etc/hosts
echo '192.168.220.112 node2.wyb.com   node2' >> /etc/hosts
# node2
hostname node2.wyb.com
sed -i 's/localhost.localdomain/node2.wyb.com/g' /etc/sysconfig/network
echo '192.168.220.111 node1.wyb.com   node1' >> /etc/hosts
echo '192.168.220.112 node2.wyb.com   node2' >> /etc/hosts

时间同步：ssh

# node1 node2
ntpdate asia.pool.ntp.org
echo '*/3 * * * * /usr/sbin/ntpdate asia.pool.ntp.org &> /dev/null' >> /var/spool/cron/root

3.安装ide

自3版本开始，heartbeat将原来项目拆分为了多个子项目(即多个独立组件)，如今的组件包括：heartbeat、cluster-glue、resource-agents。各组件主要功能：加密

heartbeat：属于集群的信息层，负责维护集群中全部节点的信息以及各节点之间的通讯。spa

cluster-glue：包括LRM（本地资源管理器）、STONITH，将heartbeat与crm（集群资源管理器）联系起来，属于一个中间层。

resource-agents：即各类资源脚本，由LRM调用从而实现各个资源的启动、中止、监控等。

设置yum源：

rpm -ivh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm

安装heartbeat/pacemaker：

yum install heartbeat heartbeat-libs pacemaker pacemaker-libs resource-agents \
    cluster-glue cluster-glue-libs

4.配置

heartbeat有3个配置文件：

        密钥文件： authkeys，用来加密集群间事务信息传递，权限必须为600；
        heartbeat：服务的配置文件ha.cf；
        haresources：资源管理配置文件；

默认目录下并无相关配置文件，能够本身手动创建，也能够直接修改软件包中自带的模板，由于使用pacemaker管理资源因此不须要拷贝haresources文件，若是使用了crm管理资源，而在配置文件目录含有haresources文件，日志中会提示haresources没有使用。

cp -p /usr/share/doc/heartbeat-3.0.4/{authkeys,ha.cf} /etc/ha.d/

配置密钥文件：

(echo -ne "auth 1\n1 md5 ";dd if=/dev/random bs=512 count=1 | md5sum) >> /etc/ha.d/authkeys
chmod 600 /etc/ha.d/authkeys

配置主配置文件ha.cf：

#集群中的节点不会自动加入
autojoin    none
 
#heartbeat会记录debug日志，若是启用use_logd，则此选项会被忽略
#debugfile   /var/log/ha-debug
 
#记录全部non-debug消息，若是启用use_logd，则此选项会被忽略
logfile    /var/log/ha-log
 
#经过syslog记录日志
#logfacility   local0
 
#指定两个心跳检测包的时间间隔
keepalive 1
 
#多久之后心跳检测决定集群中的node已经挂掉
deadtime   30
 
#心跳包检测的延时事件，若是延时，只是往日志中记录warning日志，并不切换服务
warntime  10
 
#在heartbeat启动后，在多长时间内宣布node是dead状态，由于有时候系统启动后，网络还须要一段时间才能启动
initdead  120
 
#若是udpport指令在bcast ucast指令的前面，则使用哪一个端口进行广播，不然使用默认端口
udpport   694
 
#设置使用哪一个网络接口发送UDP广播包，能够设置多个网络接口
bcast eth0
 
#设置在哪一个网络接口进行多播心跳检测
#mcast   eth0 239.0.0.1 694 1 0
 
#设置使用哪一个网络接口进行UDP单播心跳检测，在.3上为10.1.1.2
#ucast  eth0 10.1.1.3
 
#在主节点的服务恢复后，是否把从节点的服务切换回来
auto_failback on
 
#告诉集群中有哪些节点，node名称必须是uname -n显示出来的名称，能够在一个node中设置多个节点，也能够屡次设置node，每个在集群中的node都必须被列出来
node  node1.wyb.com
node  node2.wyb.com
 
#设置ping节点，仲裁设备，能够指向网关
ping 192.168.220.2

#节点故障后，是否尝试重启heartbeat服务来恢复
respawn hacluster /usr/lib64/heartbeat/ipfail

#开启Pacemaker cluster manager，由于历史缘由，次选项默认是off，可是应该保持该选项值为respawn。在设置为respawn默认自动使用如下配置
pacemaker  respawn
 
#默认配置文件中下面还有不少选项，因为暂时用不到因此暂时忽略

将配置文件复制到node2上：

scp -p /etc/ha.d/{authkeys,ha.cf} node2:/etc/ha.d/

5.安装crmsh

从pacemaker 1.1.8开始，crmsh 发展成一个独立项目，pacemaker中再也不提供，说明咱们安装好pacemaker后，是不会有crm这个命令行模式的资源管理器的。

# node1 node2
wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-6/x86_64/crmsh-2.1-1.6.x86_64.rpm
yum -y --nogpgcheck localinstall crmsh-2.1-1.6.x86_64.rpm

6.遇到的问题

问题1：

[root@node1 ~]# service heartbeat start
Starting High-Availability services:  Heartbeat failure [rc=6]. Failed.

heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Client child command [/usr/lib/heartbeat/ipfail] is not executable
heartbeat[12176]: 2015/09/11_13:30:47 info: Pacemaker support: respawn
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Client child command [/usr/lib64/heartbeat/cib] is not executable
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Directive respawn  hacluster /usr/lib64/heartbeat/cib failed
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Client child command [/usr/lib64/heartbeat/stonithd] is not executable
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Directive respawn root /usr/lib64/heartbeat/stonithd failed
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Client child command [/usr/lib64/heartbeat/attrd] is not executable
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Directive respawn  hacluster /usr/lib64/heartbeat/attrd failed
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Client child command [/usr/lib64/heartbeat/crmd] is not executable
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Directive respawn  hacluster /usr/lib64/heartbeat/crmd failed
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Heartbeat not started: configuration error.
heartbeat[12176]: 2015/09/11_13:30:47 ERROR: Configuration error, heartbeat not started.

解决办法：

ln -sv /usr/libexec/pacemaker/* /usr/lib64/heartbeat/

问题2：pacemaker程序没法启动

Sep 11 13:44:04 [12376] node1.wyb.com       crmd:     info: crm_ipc_connect:    Could not establish cib_shm connection: Connection refused (111)
Sep 11 13:44:05 [12376] node1.wyb.com       crmd:     info: crm_ipc_connect:    Could not establish cib_shm connection: Connection refused (111)
Sep 11 13:44:05 [12376] node1.wyb.com       crmd:     info: do_cib_control:     Could not connect to the CIB service: Transport endpoint is not connected
Sep 11 13:44:05 [12376] node1.wyb.com       crmd:  warning: do_cib_control:     Couldn't complete CIB registration 15 times... pause and retry
Sep 11 13:44:07 [12376] node1.wyb.com       crmd:     info: crm_timer_popped:   Wait Timer (I_NULL) just popped (2000ms)

解决办法：此问题还没有解决，不知道是软件BUG仍是其余什么缘由，经过本身从网络(http://rpm.pbone.net)下载其余版本的软件安装仍是出现一样问题，网络上也找不到相似问题的解决方案。

问题3：经过heartbeat自带的haresource代替pacemaker进行资源管理时，两节点之间没法正常传递心跳信息，致使资源在两节点上都启动。

Sep 18 18:43:38 node1.wyb.com heartbeat: [11374]: info: Configuration validated. Starting heartbeat 3.0.4
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: heartbeat: version 3.0.4
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: Heartbeat generation: 1442572552
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth0
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth0 - Status: 1
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: glib: ping heartbeat started.
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: G_main_add_TriggerHandler: Added signal manual handler
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: G_main_add_TriggerHandler: Added signal manual handler
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: G_main_add_SignalHandler: Added signal handler for signal 17
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: Local status now set to: 'up'
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: Link 192.168.220.2:192.168.220.2 up.
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: Status update for node 192.168.220.2: status ping
Sep 18 18:43:38 node1.wyb.com heartbeat: [11375]: info: Link node1.wyb.com:eth0 up.
Sep 18 18:44:08 node1.wyb.com heartbeat: [11375]: WARN: node node2.wyb.com: is dead

解决办法：还没有解决，iptables和selinux都已关闭，两节点间也能互相ping通，无奈。

7.总结

综上所述，并无成功实现heartbeat+pacemaker高可用功能，遇到各类奇葩问题，花费了近一个星期时间，重装了N次，现已无能为力，因为时间问题，并且如今heartbeat已处于维护阶段，再也不更新，corosync将成为主流，因此留待之后有时间时再来检查。

参考资料：

heartbeat + pacemaker实现pg流复制自动切换：

http://my.oschina.net/lianshunke/blog/200411?p=`currentPage-1`

Heartbeat3.0.5+pacemaker：http://my.oschina.net/guol/blog/90128

Linux高可用（HA）集群之Pacemaker详解：http://www.linuxeye.com/Linux/1899.html