MySQL高可用架构-MHA环境部署记录

 

1、MHA介绍html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就任于Facebook公司)开发,是日本的一位
MySQL专家采用Perl语言编写的一个脚本管理工具,该工具仅适用于MySQLReplication(二层)环境,目的在于维持Master主库的高可用性。是一套优秀的做为MySQL高可用性
环境下故障切换和主从提高的高可用软件。在MySQL故障切换过程当中,MHA能作到在0~30秒以内自动完成数据库的故障切换操做,而且在进行故障切换的过程当中,MHA能在最大程度上
保证数据的一致性,以达到真正意义上的高可用。
 
MHA是自动的master故障转移和Slave提高的软件包.它是基于标准的MySQL复制(异步/半同步).该软件由两部分组成:MHA Manager(管理节点)和MHA Node(数据节点)。
1)MHA Manager能够单独部署在一台独立的机器上管理多个master-slave集群,也能够部署在一台slave节点上。MHA Manager会定时探测集群中的node节点,当发现master
   出现故障的时候,它能够自动将具备最新数据的slave提高为新的master,而后将全部其它的slave导向新的master上.整个故障转移过程对应用程序是透明的。
2)MHA Node运行在每台MySQL服务器上,它经过监控具有解析和清理logs功能的脚原本加快故障转移的。
 
在MHA自动故障切换过程当中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不老是可行的。例如,若是主服务器硬件故障或没法经过 ssh 访问,
MHA无法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,能够大大下降数据丢失的风险。MHA能够与半同步复制结合起来。若是只有一个slave
已经收到了最新的二进制日志,MHA能够将最新的二进制日志应用于其余全部的slave服务器上,所以能够保证全部节点的数据一致性。
 
目前MHA主要支持一主多从的架构,要搭建MHA,要求一个复制集群中必须最少有三台数据库服务器,一主二从,即一台充当master,一台充当备用master,另一台充当从库,由于至
少须要三台服务器,出于机器成本的考虑,淘宝也在该基础上进行了改造,目前淘宝TMHA已经支持一主一从。

2、MHA工做架构说明node

展现了如何经过MHA Manager管理多组主从复制。能够将MHA工做原理总结为以下:mysql

1
2
3
4
5
6
7
8
相较于其它HA软件,MHA的目的在于维持MySQL Replication中Master库的高可用性,其最大特色是能够修复多个Slave之间的差别日志,最终使全部Slave保持数据一致,
而后从中选择一个充当新的Master,并将其它Slave指向它。工做流程主要以下:
1)从宕机崩溃的master保存二进制日志事件(binlog events);
2)识别含有最新更新的slave;
3)应用差别的中继日志(relay log)到其余的slave;
4)应用从master保存的二进制日志事件(binlog events);
5)提高一个slave为新的master;
6)使其余的slave链接新的master进行复制;

MHA工做原理linux

1
2
3
4
5
当master出现故障时,经过对比slave之间I /O 线程读取master binlog的位置,选取最接近的slave作为latest slave。其它slave经过与latest slave对比生成差别中继日志。
在latest slave上应用从master保存的binlog,同时将latest slave提高为master。最后在其它slave上应用相应的差别中继日志并开始重新的master开始复制。
 
在MHA实现Master故障切换过程当中,MHA Node会试图访问故障的master(经过SSH),若是能够访问(不是硬件故障,好比InnoDB数据文件损坏等),会保存二进制文件,以最大程度
保证数据不丢失。MHA和半同步复制一块儿使用会大大下降数据丢失的危险。

MHA软件的架构:由两部分组成,Manager工具包和Node工具包,具体的说明以下。
Manager工具包主要包括如下几个工具:redis

1
2
3
4
5
6
7
masterha_check_ssh              检查MHA的SSH配置情况
masterha_check_repl             检查MySQL复制情况
masterha_manger                 启动MHA
masterha_check_status           检测当前MHA运行状态
masterha_master_monitor         检测master是否宕机
masterha_master_switch          控制故障转移(自动或者手动)
masterha_conf_host              添加或删除配置的server信息

Node工具包(这些工具一般由MHA Manager的脚本触发,无需人为操做)主要包括如下几个工具:sql

1
2
3
4
5
6
7
8
9
10
11
12
save_binary_logs(保存二进制日志)             保存和复制master的二进制日志
apply_diff_relay_logs(应用差别中继日志)      识别差别的中继日志事件并将其差别的事件应用于其余的slave
filter_mysqlbinlog                          去除没必要要的ROLLBACK事件(MHA已再也不使用这个工具)
purge_relay_logs(清理中继日志)               清除中继日志(不会阻塞SQL线程)
.....................................................................................................
MHA如何保持数据的一致性呢?主要经过MHA node的如下几个工具实现,可是这些工具由mha manager触发:
save_binary_logs         若是master的二进制日志能够存取的话,保存复制master的二进制日志,最大程度保证数据不丢失
apply_diff_relay_logs    相对于最新的slave,生成差别的中继日志并将全部差别事件应用到其余全部的slave
 
注意:
对比的是relay log,relay log越新就越接近于master,才能保证数据是最新的。
purge_relay_logs删除中继日志而不阻塞sql线程

MHA的优点shell

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1)故障切换快
在主从复制集群中,只要从库在复制上没有延迟,MHA一般能够在数秒内实现故障切换。9-10秒内检查到master故障,能够选择在7-10秒关闭master以免出现裂脑,几秒钟内,
将差别中继日志(relay log)应用到新的master上,所以总的宕机时间一般为10-30秒。恢复新的master后,MHA并行的恢复其他的slave。即便在有数万台slave,也不会
影响master的恢复时间。
 
DeNA在超过150个MySQL(主要5.0 /5 .1版本)主从环境下使用了MHA。当mater故障后,MHA在4秒内就完成了故障切换。在传统的主动/被动集群解决方案中,4秒内完成故障切换是不可能的。
 
2)master故障不会致使数据不一致
当目前的master出现故障时,MHA自动识别slave之间中继日志(relay log)的不一样,并应用到全部的slave中。这样全部的salve可以保持同步,只要全部的slave处于存活
状态。和Semi-Synchronous Replication一块儿使用,(几乎)能够保证没有数据丢失。
 
3)无需修改当前的MySQL设置
MHA的设计的重要原则之一就是尽量地简单易用。MHA工做在传统的MySQL版本5.0和以后版本的主从复制环境中。和其它高可用解决方法比,MHA并不须要改变MySQL的部署环境。
MHA适用于异步和半同步的主从复制。
 
启动/中止/升级/降级/安装/卸载MHA不须要改变(包扩启动/中止)MySQL复制。当须要升级MHA到新的版本,不须要中止MySQL,仅仅替换到新版本的MHA,而后重启MHA Manager
就行了。
 
MHA运行在MySQL 5.0开始的原生版本上。一些其它的MySQL高可用解决方案须要特定的版本(好比MySQL集群、带全局事务ID的MySQL等等),但并不只仅为了master的高可用才迁移应用的。在大多数状况下,已经部署了比较旧MySQL应用,而且不想仅仅为了实现Master的高可用,花太多的时间迁移到不一样的存储引擎或更新的前沿发行版。MHA工做的包括5.0 /5 .1 /5 .5的原生版本的MySQL上,因此并不须要迁移。
 
4)无需增长大量的服务器
MHA由MHA Manager和MHA Node组成。MHA Node运行在须要故障切换/恢复的MySQL服务器上,所以并不须要额外增长服务器。MHA Manager运行在特定的服务器上,所以须要
增长一台(实现高可用须要2台),可是MHA Manager能够监控大量(甚至上百台)单独的master,所以,并不须要增长大量的服务器。即便在一台slave上运行MHA Manager也是
能够的。综上,实现MHA并没用额外增长大量的服务。
 
5)无性能降低
MHA适用与异步或半同步的MySQL复制。监控master时,MHA仅仅是每隔几秒(默认是3秒)发送一个 ping 包,并不发送重查询。能够获得像原生MySQL复制同样快的性能。
 
6)适用于任何存储引擎
MHA能够运行在只要MySQL复制运行的存储引擎上,并不只限制于InnoDB,即便在不易迁移的传统的MyISAM引擎环境,同样可使用MHA。

3、MHA高可用环境部署记录数据库

1)机器环境vim

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
ip地址             主机名           角色
182.48.115.236    Node_Master     写入,数据节点
182.48.115.237    Node_Slave      读,数据节点,备选Master(candicate master)
182.48.115.238    Manager_Slave   读,数据节点,也做为Manager server(即也做为manager节点)
........................................................................................................
为了节省机器,这里选择只读的从库182.48.115.237(从库不对外提供读的服务)做为候选主库,即candicate master,或是专门用于备份
一样,为了节省机器,这里选择182.48.115.238这台从库做为manager server(实际生产环节中,机器充足的状况下, 通常是专门选择一台机器做为Manager server)
........................................................................................................
 
关闭三台机器的iptables和selinux
 
部署节点之间 ssh 无密码登录的信任关系(即在全部节点上作 ssh 免密码登陆,包括对节点本机的信任)
[root@Node_Master ~] # ssh-copy-id 182.48.115.236
[root@Node_Master ~] # ssh-copy-id 182.48.115.237
[root@Node_Master ~] # ssh-copy-id 182.48.115.238
 
[root@Node_Slave ~] # ssh-copy-id 182.48.115.236
[root@Node_Slave ~] # ssh-copy-id 182.48.115.237
[root@Node_Slave ~] # ssh-copy-id 182.48.115.238
 
[root@Manager_Slave ~] # ssh-copy-id 182.48.115.236
[root@Manager_Slave ~] # ssh-copy-id 182.48.115.237
[root@Manager_Slave ~] # ssh-copy-id 182.48.115.238
 
如今3台节点已经能实现两两互相 ssh 通了,不须要输入密码便可。若是不能实现任何两台主机互相之间能够无密码登陆,后面的环节可能会有问题。

2)实现主机名hostname登陆(在三台节点上都须要执行)(这一步不是必需要操做的)bash

1
2
3
4
5
6
7
8
9
分别设置三台节点机器的主机名(主机名上面已提出),并绑定hosts.
三台机器的 /etc/hosts 文件的绑定信息以下:
[root@Node_Master ~] # vim /etc/hosts
.......
182.48.115.236    Node_Master
182.48.115.237    Node_Slave
182.48.115.238    Manager_Slave
 
相互验证下使用主机名登录是否正常,是否能够相互使用主机名 ssh 无密码登录到对方。

3)准备好Mysql主从环境

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
架构以下,一主二从的架构:
主库:182.48.115.236    从库:182.48.115.237
主库:182.48.115.236    从库:182.48.115.238
 
Mysql主从环境部署能够参考:http: //www .cnblogs.com /kevingrace/p/6256603 .html
.......................................................................................
------主库配置------
server- id =1      
log-bin=mysql-bin   
binlog-ignore-db=mysql 
sync_binlog = 1    
binlog_checksum = none 
binlog_format = mixed
------从库1配置-------
server- id =2  
log-bin=mysql-bin  
binlog-ignore-db=mysql        // 千万要注意:主从同步中的过滤字段要一致,不然后面使用masterha_check_repl 检查复制时就会出错!
slave-skip-errors = all
------从库2配置-------
server- id =3
log-bin=mysql-bin  
binlog-ignore-db=mysql  
slave-skip-errors = all
 
而后主库受权给从库链接的权限,设置后,最好在从库上验证下是否能使用授予的权限链接主库。
而后在从库上根据主库的“show master status;” 信心进行change master.....同步设置。
 
注意:
主从设置时,若是设置了bbinlog-ignore-db 和 replicate-ignore-db 过滤规则,则主从必须相同。即要使用binlog-ignore-db过滤字段,则主从配置都使用这个,
要是使用replicate-ignore-db过滤字段,则主从配置都使用这个,千万不能主从配置使用的过滤字段不同!由于MHA 在启动时候会检测过滤规则,若是过滤规则不一样,MHA 不启动监控和故障转移。
.......................................................................................

4)建立用户mha管理的帐号(在三台节点上都须要执行)

1
2
3
4
5
6
7
8
9
10
11
12
mysql> GRANT SUPER,RELOAD,REPLICATION CLIENT,SELECT ON *.* TO manager@ '182.48.115.%'  IDENTIFIED BY  'manager_1234' ;
Query OK, 0 rows affected (0.06 sec)
 
mysql> GRANT CREATE,INSERT,UPDATE,DELETE,DROP ON*.* TO manager@ '182.48.115.%' ;
Query OK, 0 rows affected (0.05 sec)
 
建立主从帐号(在三台节点上都须要执行):
mysql> GRANT RELOAD, SUPER, REPLICATION SLAVE ON*.* TO  'repl' @ '182.48.115.%'  IDENTIFIED BY  'repl_1234' ;
Query OK, 0 rows affected (0.09 sec)
 
mysql> flush privileges;
Query OK, 0 rows affected (0.06 sec)

5)开始安装mha
mha包括manager节点和data节点,其中:
data节点包括原有的MySQL复制结构中的主机,至少3台,即1主2从,当master failover后,还能保证主从结构;只需安装node包。
manager server:运行监控脚本,负责monitoring 和 auto-failover;须要安装node包和manager包。

5.1)在全部data数据节点机上安装安装MHA node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
下载mha4mysql-node-0.56. tar .gz
下载地址:http: //pan .baidu.com /s/1cphgLo
提取密码:7674   
 
[root@Node_Master ~] # yum -y install perl-DBD-MySQL      //先安装所需的perl模块
[root@Node_Master ~] # tar -zvxf mha4mysql-node-0.56.tar.gz
[root@Node_Master ~] # cd mha4mysql-node-0.56
[root@Node_Master mha4mysql-node-0.56] # perl Makefile.PL
................................................................................................................
这一步可能报错以下:
1)Can't  locate  ExtUtils /MakeMaker .pm  in  @INC (@INC contains: inc  /usr/local/lib64/perl5  /usr/local/share/perl5 ......
解决办法:
[root@Node_Master mha4mysql-node-0.56] # yum install perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker
 
2)Can't  locate  CPAN.pm  in  @INC (@INC contains: inc  /usr/local/lib64/perl5  /usr/local/share/perl5  /usr/lib64/perl5 ....
解决办法:
[root@Node_Master mha4mysql-node-0.56] # yum install -y perl-CPAN
................................................................................................................
[root@Node_Master mha4mysql-node-0.56] # make && make install

5.2)在manager节点(即182.48.115.238)上安装MHA Manager(注意manager节点也要安装MHA node)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
首先下载第三方yum源
[root@Manager_Slave ~] # rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
 
安装perl的mysql包:
[root@Manager_Slave ~] # yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Config-IniFiles perl-Time-HiRes -y
 
安装MHA Manager软件包:
下载地址:https: //pan .baidu.com /s/1slyfXN3
提取密码:86wb
[root@Manager_Slave ~] # tar -vxf mha4mysql-manager-0.56.tar
[root@Manager_Slave ~] # cd mha4mysql-manager-0.56    
[root@Manager_Slave mha4mysql-manager-0.56] # perl Makefile.PL
[root@Manager_Slave mha4mysql-manager-0.56] # make && make install
 
安装完MHA Manager后,在 /usr/local/bin 目录下会生成如下脚本:
[root@Manager_Slave mha4mysql-manager-0.56] # ll /usr/local/bin/
总用量 84
-r-xr-xr-x. 1 root root 16367 5月  31 21:37 apply_diff_relay_logs
-r-xr-xr-x. 1 root root  4807 5月  31 21:37 filter_mysqlbinlog
-r-xr-xr-x. 1 root root  1995 5月  31 22:23 masterha_check_repl
-r-xr-xr-x. 1 root root  1779 5月  31 22:23 masterha_check_ssh
-r-xr-xr-x. 1 root root  1865 5月  31 22:23 masterha_check_status
-r-xr-xr-x. 1 root root  3201 5月  31 22:23 masterha_conf_host
-r-xr-xr-x. 1 root root  2517 5月  31 22:23 masterha_manager
-r-xr-xr-x. 1 root root  2165 5月  31 22:23 masterha_master_monitor
-r-xr-xr-x. 1 root root  2373 5月  31 22:23 masterha_master_switch
-r-xr-xr-x. 1 root root  5171 5月  31 22:23 masterha_secondary_check
-r-xr-xr-x. 1 root root  1739 5月  31 22:23 masterha_stop
-r-xr-xr-x. 1 root root  8261 5月  31 21:37 purge_relay_logs
-r-xr-xr-x. 1 root root  7525 5月  31 21:37 save_binary_logs
 
其中:
masterha_check_repl             检查MySQL复制情况
masterha_check_ssh              检查MHA的SSH配置情况
masterha_check_status           检测当前MHA运行状态
masterha_conf_host              添加或删除配置的server信息
masterha_manager                启动MHA
masterha_stop                   中止MHA
masterha_master_monitor         检测master是否宕机
masterha_master_switch          控制故障转移(自动或者手动)
masterha_secondary_check        多种线路检测master是否存活
 
另外:
在.. /mha4mysql-manager-0 .56 /samples/scripts 下还有如下脚本,须要将其复制到 /usr/local/bin
[root@Manager_Slave mha4mysql-manager-0.56] # cd samples/scripts/
[root@Manager_Slave scripts] # ll
总用量 32
-rwxr-xr-x. 1 4984  users   3648 4月   1 2014 master_ip_failover             // 自动切换时VIP管理脚本,不是必须,若是咱们使用keepalived的,咱们能够本身编写脚本完成对vip的管理,好比监控mysql,若是mysql异常,咱们中止keepalived就行,这样vip就会自动漂移
-rwxr-xr-x. 1 4984  users   9870 4月   1 2014 master_ip_online_change        // 在线切换时VIP脚本,不是必须,一样能够能够自行编写简单的shell完成
-rwxr-xr-x. 1 4984  users  11867 4月   1 2014 power_manager                  // 故障发生后关闭master脚本,不是必须
-rwxr-xr-x. 1 4984  users   1360 4月   1 2014 send_report                    // 故障切换发送报警脚本,不是必须,可自行编写简单的shell完成
[root@Manager_Slave scripts] # cp ./* /usr/local/bin/

5.3)在管理节点(182.48.115.238)上进行下面配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[root@Manager_Slave mha4mysql-manager-0.56] # mkdir -p /etc/masterha
[root@Manager_Slave mha4mysql-manager-0.56] # cp samples/conf/app1.cnf /etc/masterha/
[root@Manager_Slave mha4mysql-manager-0.56] # vim /etc/masterha/app1.cnf
[server default]
manager_workdir= /var/log/masterha/app1             // 设置manager的工做目录
manager_log= /var/log/masterha/app1/manager .log     // 设置manager的日志
  
ssh_user=root                                      //ssh 免密钥登陆的账号名
repl_user=repl                                     //mysql 复制账号,用来在主从机之间同步二进制日志等
repl_password=repl_1234                            // 设置mysql中root用户的密码,这个密码是前文中建立监控用户的那个密码
ping_interval=1                                    // 设置监控主库,发送 ping 包的时间间隔,用来检查master是否正常,默认是3秒,尝试三次没有回应的时候自动进行railover
master_ip_failover_script=  /usr/local/bin/master_ip_failover                // 设置自动failover时候的切换脚本
master_ip_online_change_script=  /usr/local/bin/master_ip_online_change      // 设置手动切换时候的切换脚本
  
[server1]
hostname =182.48.115.236
port=3306
master_binlog_dir= /data/mysql/data/        // 设置master 保存binlog的位置,以便MHA能够找到master的日志,我这里的也就是mysql的数据目录
  
[server2]
hostname =182.48.115.237
port=3306
candidate_master=1           // 设置为候选master,即master机宕掉后,优先启用这台做为新master,若是设置该参数之后,发生主从切换之后将会将此从库提高为主库,即便这个主库不是集群中事件最新的slave
check_repl_delay=0          // 默认状况下若是一个slave落后master 100M的relay logs的话,MHA将不会选择该slave做为一个新的master,由于对于这个slave的恢复须要花费很长时间,经过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机很是有用,由于这个候选主在切换的过程当中必定是新的master
master_binlog_dir= /data/mysql/data/
  
[server3]
hostname =182.48.115.238
port=3306
#candidate_master=1
master_binlog_dir= /data/mysql/data/
  
#[server4]
#hostname=host4
#no_master=1

5.4)设置relay log的清除方式(在两台slave节点上)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[root@Node_Slave ~] # mysql -p123456 -e 'set global relay_log_purge=0'
[root@Manager_Slave ~] # mysql -p123456 -e 'set global relay_log_purge=0'
..................................................................................................
舒适提示:
MHA在发生切换的过程当中,从库的恢复过程当中依赖于relay log的相关信息,因此这里要将relay log的自动清除设置为OFF,采用手动清除relay log的方式。
在默认状况下,从服务器上的中继日志会在SQL线程执行完毕后被自动删除。可是在MHA环境中,这些中继日志在恢复其余从服务器时可能会被用到,所以须要禁用
中继日志的自动删除功能。按期清除中继日志须要考虑到复制延时的问题。在ext3的文件系统下,删除大的文件须要必定的时间,会致使严重的复制延时。为了避
免复制延时,须要暂时为中继日志建立硬连接,由于在linux系统中经过硬连接删除大文件速度会很快。(在mysql数据库中,删除大表时,一般也采用创建硬连接的方式)
 
MHA节点中包含了pure_relay_logs命令工具,它能够为中继日志建立硬连接,执行SET GLOBAL relay_log_purge=1,等待几秒钟以便SQL线程切换到新的中继日志,
再执行SET GLOBAL relay_log_purge=0。
 
pure_relay_logs脚本参数以下所示:
--user mysql            用户名
--password mysql        密码
--port                  端口号
--workdir               指定建立relay log的硬连接的位置,默认是 /var/tmp ,因为系统不一样分区建立硬连接文件会失败,故须要执行硬连接具体位置,成功执行脚本后,硬连接的中继日志文件被删除
--disable_relay_log_purge     默认状况下,若是relay_log_purge=1,脚本会什么都不清理,自动退出,经过设定这个参数,当relay_log_purge=1的状况下会将relay_log_purge设置为0。清理relay log以后,最后将参数设置为OFF。
 
设置按期清理relay脚本(在两台slave节点上操做)
[root@Node_Slave ~] # vim /root/purge_relay_log.sh
#!/bin/bash
user=root
passwd =123456
port=3306
host=localhost
log_dir= '/data/masterha/log'
work_dir= '/data'
purge= '/usr/local/bin/purge_relay_logs'
 
if  [ ! -d $log_dir ]
then
    mkdir  $log_dir -p
fi
 
$purge --user=$user --host=$host --password=$ passwd  --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir /purge_relay_logs .log 2>&1
 
[root@Node_Slave ~] # chmod 755 /root/purge_relay_log.sh
 
添加到 crontab 按期执行
[root@Node_Slave ~] # crontab -e
0 4 * * *  /bin/bash  /root/purge_relay_log .sh
 
purge_relay_logs脚本删除中继日志不会阻塞SQL线程。下面手动执行看看什么状况。
[root@Node_Slave ~] # /usr/local/bin/purge_relay_logs --user=root --host=localhost --password=123456 --disable_relay_log_purge --port=3306 --workdir=/data
2017-05-31 23:27:13: purge_relay_logs script started.
  Found relay_log.info:  /data/mysql/data/relay-log .info
  Opening  /data/mysql/data/mysql-relay-bin .000002 ..
  Opening  /data/mysql/data/mysql-relay-bin .000003 ..
  Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files ( if  it keeps up); SET GLOBAL relay_log_purge=0; .. ok.
2017-05-31 23:27:17: All relay log purging operations succeeded.
 
[root@Node_Slave ~] # ll /data/masterha/log/
总用量 4
-rw-r--r--. 1 root root 905 5月  31 23:26 purge_relay_logs.log

5.5)检查SSH配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
检查MHA Manger到全部MHA Node的SSH链接状态:
[root@Manager_Slave ~] # masterha_check_ssh --conf=/etc/masterha/app1.cnf
Wed May 31 23:06:01 2017 - [warning] Global configuration  file  /etc/masterha_default .cnf not found. Skipping.
Wed May 31 23:06:01 2017 - [info] Reading application default configuration from  /etc/masterha/app1 .cnf..
Wed May 31 23:06:01 2017 - [info] Reading server configuration from  /etc/masterha/app1 .cnf..
Wed May 31 23:06:01 2017 - [info] Starting SSH connection tests..
Wed May 31 23:06:04 2017 - [debug]
Wed May 31 23:06:01 2017 - [debug]  Connecting via SSH from root@182.48.115.236(182.48.115.236:22) to root@182.48.115.237(182.48.115.237:22)..
Wed May 31 23:06:02 2017 - [debug]   ok.
Wed May 31 23:06:02 2017 - [debug]  Connecting via SSH from root@182.48.115.236(182.48.115.236:22) to root@182.48.115.238(182.48.115.238:22)..
Wed May 31 23:06:03 2017 - [debug]   ok.
Wed May 31 23:06:04 2017 - [debug]
Wed May 31 23:06:01 2017 - [debug]  Connecting via SSH from root@182.48.115.237(182.48.115.237:22) to root@182.48.115.236(182.48.115.236:22)..
Wed May 31 23:06:03 2017 - [debug]   ok.
Wed May 31 23:06:03 2017 - [debug]  Connecting via SSH from root@182.48.115.237(182.48.115.237:22) to root@182.48.115.238(182.48.115.238:22)..
Wed May 31 23:06:04 2017 - [debug]   ok.
Wed May 31 23:06:04 2017 - [debug]
Wed May 31 23:06:02 2017 - [debug]  Connecting via SSH from root@182.48.115.238(182.48.115.238:22) to root@182.48.115.236(182.48.115.236:22)..
Wed May 31 23:06:03 2017 - [debug]   ok.
Wed May 31 23:06:03 2017 - [debug]  Connecting via SSH from root@182.48.115.238(182.48.115.238:22) to root@182.48.115.237(182.48.115.237:22)..
Wed May 31 23:06:04 2017 - [debug]   ok.
Wed May 31 23:06:04 2017 - [info] All SSH connection tests passed successfully.
 
能够看见各个节点 ssh 验证都是ok的。

5.6)使用mha工具check检查repl环境

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
经过masterha_check_repl脚本查看整个mysql集群的复制状态
[root@Manager_Slave ~] # masterha_check_repl --conf=/etc/masterha/app1.cnf
Wed May 31 23:43:43 2017 - [warning] Global configuration  file  /etc/masterha_default .cnf not found. Skipping.
Wed May 31 23:43:43 2017 - [info] Reading application default configuration from  /etc/masterha/app1 .cnf..
Wed May 31 23:43:43 2017 - [info] Reading server configuration from  /etc/masterha/app1 .cnf..
Wed May 31 23:43:43 2017 - [info] MHA::MasterMonitor version 0.56.
Wed May 31 23:43:43 2017 - [error][ /usr/local/share/perl5/MHA/ServerManager .pm, ln301] Got MySQL error when connecting 182.48.115.237(182.48.115.237:3306) :1045:Access denied  for  user  'root' @ '182.48.115.238'  (using password: NO), but this is not a MySQL crash. Check MySQL server settings.
  at  /usr/local/share/perl5/MHA/ServerManager .pm line 297
Wed May 31 23:43:43 2017 - [error][ /usr/local/share/perl5/MHA/ServerManager .pm, ln301] Got MySQL error when connecting 182.48.115.236(182.48.115.236:3306) :1045:Access denied  for  user  'root' @ '182.48.115.238'  (using password: NO), but this is not a MySQL crash. Check MySQL server settings.
  at  /usr/local/share/perl5/MHA/ServerManager .pm line 297
Wed May 31 23:43:43 2017 - [error][ /usr/local/share/perl5/MHA/ServerManager .pm, ln301] Got MySQL error when connecting 182.48.115.238(182.48.115.238:3306) :1045:Access denied  for  user  'root' @ '182.48.115.238'  (using password: NO), but this is not a MySQL crash. Check MySQL server settings.
  at  /usr/local/share/perl5/MHA/ServerManager .pm line 297
Wed May 31 23:43:43 2017 - [error][ /usr/local/share/perl5/MHA/ServerManager .pm, ln309] Got fatal error, stopping operations
Wed May 31 23:43:43 2017 - [error][ /usr/local/share/perl5/MHA/MasterMonitor .pm, ln424] Error happened on checking configurations.  at  /usr/local/share/perl5/MHA/MasterMonitor .pm line 326
Wed May 31 23:43:43 2017 - [error][ /usr/local/share/perl5/MHA/MasterMonitor .pm, ln523] Error happened on monitoring servers.
Wed May 31 23:43:43 2017 - [info] Got  exit  code 1 (Not master dead).
 
MySQL Replication Health is NOT OK!
 
发现上面的复制环节是不ok的!!!
缘由是经过root用户远程链接节点的mysql不通
..............................................................................................................
解决办法:在三个节点机器上的mysql上受权,容许182.48.115.%的机器经过root用户无密码登录,即
mysql> update mysql.user  set  password=password( "" ) where user= "root"  and host= "182.48.115.%" ;    // 若是没有这个权限,就grant命令建立这个用户权限
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
 
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
 
mysql>  select  user,host,password from mysql.user;
+---------+--------------+-------------------------------------------+
| user    | host         | password                                  |
+---------+--------------+-------------------------------------------+
.........
| root    | 182.48.115.% |                                           |
+---------+--------------+-------------------------------------------+
11 rows  in  set  (0.00 sec)
..............................................................................................................
 
而后再次经过masterha_check_repl脚本查看整个mysql集群的复制状态
[root@Manager_Slave ~] # masterha_check_repl --conf=/etc/masterha/app1.cnf
..............................
Bareword  "FIXME_xxx"  not allowed  while  "strict subs"  in  use at  /usr/local/bin/master_ip_failover  line 93.
 
仍是出现如上报错,缘由是:
原来Failover两种方式:一种是虚拟IP地址,一种是全局配置文件。MHA并无限定使用哪种方式,而是让用户本身选择,虚拟IP地址的方式会牵扯到其它的软件,好比keepalive软件,并且还要修改脚本master_ip_failover。
 
解决办法以下:
添加软链接(全部节点):
[root@Manager_Slave ~] # ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
[root@Manager_Slave ~] # ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
 
先暂时注释掉管理节点的 /etc/masterha/app1 .cnf文件中的master_ip_failover_script=  /usr/local/bin/master_ip_failover 这个选项。
后面引入keepalived后和修改该脚本之后再开启该选项。
[root@Manager_Slave ~] # cat /etc/masterha/app1.cnf
.........                             
#master_ip_failover_script= /usr/local/bin/master_ip_failover
 
最后在经过masterha_check_repl脚本查看整个mysql集群的复制状态
[root@Manager_Slave ~] # masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Jun  1 00:20:58 2017 - [warning] Global configuration  file  /etc/masterha_default .cnf not found. Skipping.
Thu Jun  1 00:20:58 2017 - [info] Reading application default configuration from  /etc/masterha/app1 .cnf..
 
Thu Jun  1 00:20:58 2017 - [info]  read_only=1 is not  set  on slave 182.48.115.237(182.48.115.237:3306).
Thu Jun  1 00:20:58 2017 - [warning]  relay_log_purge=0 is not  set  on slave 182.48.115.237(182.48.115.237:3306).
Thu Jun  1 00:20:58 2017 - [info]  read_only=1 is not  set  on slave 182.48.115.238(182.48.115.238:3306).
Thu Jun  1 00:20:58 2017 - [warning]  relay_log_purge=0 is not  set  on slave 182.48.115.238(182.48.115.238:3306).
Thu Jun  1 00:20:58 2017 - [info] Checking replication filtering settings..
Thu Jun  1 00:20:58 2017 - [info]  binlog_do_db= , binlog_ignore_db= mysql
Thu Jun  1 00:20:58 2017 - [info]  Replication filtering check ok.
Thu Jun  1 00:20:58 2017 - [info] GTID (with auto-pos) is not supported
Thu Jun  1 00:20:58 2017 - [info] Starting SSH connection tests..
Thu Jun  1 00:21:02 2017 - [info] All SSH connection tests passed successfully.
...........
 
Thu Jun  1 00:21:07 2017 - [info] Checking replication health on 182.48.115.237..
Thu Jun  1 00:21:07 2017 - [info]  ok.
Thu Jun  1 00:21:07 2017 - [info] Checking replication health on 182.48.115.238..
Thu Jun  1 00:21:07 2017 - [info]  ok.
Thu Jun  1 00:21:07 2017 - [warning] master_ip_failover_script is not defined.
Thu Jun  1 00:21:07 2017 - [warning] shutdown_script is not defined.
Thu Jun  1 00:21:07 2017 - [info] Got  exit  code 0 (Not master dead).
 
MySQL Replication Health is OK.
 
这个时候,发现整个复制环境情况是ok的了!!

6)管理mha操做
6.1)检查MHA Manager的状态

1
2
3
4
5
经过master_check_status脚本查看Manager的状态
[root@Manager_Slave ~] # masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
 
注意:若是正常,会显示 "PING_OK" ,不然会显示 "NOT_RUNNING" ,这表明MHA监控没有开启

6.2)开启MHA Manager监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
使用下面命令放在后台执行启动动做
[root@Manager_Slave ~] # nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
 
启动参数介绍:
--remove_dead_master_conf      该参数表明当发生主从切换后,老的主库的ip将会从配置文件中移除。
--manger_log                   日志存放位置
--ignore_last_failover         在缺省状况下,若是MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之因此这样限制是为了
                                避免 ping -pong效应。该参数表明忽略上次MHA触发切换产生的文件,默认状况下,MHA发生切换后会在日志目录,也就是上面我
                                设置的 /data 产生app1.failover.complete文件,下次再次切换的时候若是发现该目录下存在该文件将不容许触发切换,除非
                                在第一次切换后收到删除该文件,为了方便,这里设置为--ignore_last_failover。
再次查看MHA Manager监控是否正常:
[root@Manager_Slave ~] # masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:2542) is running(0:PING_OK), master:182.48.115.236
 
能够看见已经在监控了,并且master的主机为182.48.115.236
 
查看启动日志
[root@Manager_Slave ~] # tail -n20 /var/log/masterha/app1/manager.log
   Checking slave recovery environment settings..
     Opening  /data/mysql/data/relay-log .info ... ok.
     Relay log found at  /data/mysql/data , up to mysql-relay-bin.000006
     Temporary relay log  file  is  /data/mysql/data/mysql-relay-bin .000006
     Testing mysql connection and privileges..Warning: Using a password on the  command  line interface can be insecure.
  done .
     Testing mysqlbinlog output..  done .
     Cleaning up  test  file (s)..  done .
Thu Jun  1 00:37:29 2017 - [info] Slaves settings check  done .
Thu Jun  1 00:37:29 2017 - [info]
182.48.115.236(182.48.115.236:3306) (current master)
  +--182.48.115.237(182.48.115.237:3306)
  +--182.48.115.238(182.48.115.238:3306)
 
Thu Jun  1 00:37:29 2017 - [warning] master_ip_failover_script is not defined.
Thu Jun  1 00:37:29 2017 - [warning] shutdown_script is not defined.
Thu Jun  1 00:37:29 2017 - [info] Set master  ping  interval 1 seconds.
Thu Jun  1 00:37:29 2017 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or  more  routes.
Thu Jun  1 00:37:29 2017 - [info] Starting  ping  health check on 182.48.115.236(182.48.115.236:3306)..
Thu Jun  1 00:37:29 2017 - [info] Ping(SELECT) succeeded, waiting  until  MySQL doesn't respond..
 
其中 "Ping(SELECT) succeeded, waiting until MySQL doesn't respond.." 说明整个系统已经开始监控了。

6.3)关闭MHA Manage监控

1
2
3
4
5
6
7
8
9
关闭很简单,使用masterha_stop命令完成。
[root@Manager_Slave ~] # masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+  Exit 1                   nohup  masterha_manager --conf= /etc/masterha/app1 .cnf --remove_dead_master_conf --ignore_last_failover <  /dev/null  /var/log/masterha/app1/manager .log 2>&1
[root@Manager_Slave ~] #
 
查看MHA Manager监控,发现已关闭
[root@Manager_Slave ~] # masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).

7)配置VIP
vip配置能够采用两种方式,一种经过keepalived的方式管理虚拟ip浮动;另一种经过脚本方式启动虚拟ip的方式(即不须要keepalived或者heartbeat相似的软件)。

第一种方式:经过keepalive的方式管理vip

7.1)下载软件进行并进行安装(在两台master上都要安装,准确的说一台是master(182.48.115.236);另一台是备选master(182.48.115.237),在没有切换之前是slave)

1
2
3
4
5
6
7
8
9
10
11
12
[root@Node_Master ~] # yum install -y openssl-devel
[root@Node_Master ~] # wget http://www.keepalived.org/software/keepalived-1.3.5.tar.gz
[root@Node_Master ~] # tar -zvxf keepalived-1.3.5.tar.gz
[root@Node_Master ~] # cd keepalived-1.3.5
[root@Node_Master keepalived-1.3.5] # ./configure --prefix=/usr/local/keepalived
[root@Node_Master keepalived-1.3.5] # make && make install
 
[root@Node_Master keepalived-1.3.5] # cp keepalived/etc/init.d/keepalived /etc/init.d/
[root@Node_Master keepalived-1.3.5] # cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
[root@Node_Master keepalived-1.3.5] # mkdir /etc/keepalived
[root@Node_Master keepalived-1.3.5] # cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
[root@Node_Master keepalived-1.3.5] # cp /usr/local/keepalived/sbin/keepalived /usr/sbin/

7.2)keepalived配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
------------在master上配置(182.48.115.236节点上的配置)------------------
[root@Node_Master ~] # cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak
[root@Node_Master ~] # vim /etc/keepalived/keepalived.conf
! Configuration File  for  keepalived
 
global_defs {
      notification_email {
      wangshibo@huanqiu.cn
    }
    notification_email_from ops@huanqiu.cn
    smtp_server 127.0.0.1
    smtp_connect_timeout 30
    router_id MySQL-HA
}
 
vrrp_instance VI_1 {
     state BACKUP
     interface eth1
     virtual_router_id 51
     priority 150
     advert_int 1
     nopreempt
 
     authentication {
     auth_type PASS
     auth_pass 1111
     }
 
     virtual_ipaddress {
         182.48.115.239
     }
}
 
其中router_id MySQL HA表示设定keepalived组的名称,将182.48.115.239这个虚拟ip绑定到该主机的eth1网卡上,而且设置了状态为backup模式,
将keepalived的模式设置为非抢占模式(nopreempt),priority 150表示设置的优先级为150。
 
------------在candicate master上配置(182.48.115.237节点上的配置)------------------
[root@Node_Slave ~] # vim /etc/keepalived/keepalived.conf
! Configuration File  for  keepalived
 
global_defs {
      notification_email {
      wangshibo@huanqiu.cn
    }
    notification_email_from ops@huanqiu.cn
    smtp_server 127.0.0.1
    smtp_connect_timeout 30
    router_id MySQL-HA
}
 
vrrp_instance VI_1 {
     state BACKUP
     interface eth1
     virtual_router_id 51
     priority 120
     advert_int 1
     nopreempt
 
     authentication {
     auth_type PASS
     auth_pass 1111
     }
 
     virtual_ipaddress {
         182.48.115.239
     }
}

7.3)启动keepalived服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
--------------在master上启动并查看日志----------------
[root@Node_Master ~] # /etc/init.d/keepalived start
正在启动 keepalived:                                      [肯定]
[root@Node_Master ~] # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
     link /loopback  00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1 /8  scope host lo
     inet6 ::1 /128  scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
     link /ether  52:54:00:5f:58: dc  brd ff:ff:ff:ff:ff:ff
     inet 182.48.115.236 /27  brd 182.48.115.255 scope global eth0
     inet 182.48.115.239 /32  scope global eth0
     inet6 fe80::5054:ff:fe5f:58dc /64  scope link
        valid_lft forever preferred_lft forever
 
[root@Node_Master ~] # tail -100 /var/log/messages
..........
Jun  1 02:12:10 percona1 Keepalived_vrrp[10329]: VRRP_Instance(VI_1) Sending /queueing  gratuitous ARPs on eth0  for  182.48.115.239
Jun  1 02:12:10 percona1 Keepalived_vrrp[10329]: Sending gratuitous ARP on eth0  for  182.48.115.239
Jun  1 02:12:10 percona1 Keepalived_vrrp[10329]: Sending gratuitous ARP on eth0  for  182.48.115.239
Jun  1 02:12:10 percona1 Keepalived_vrrp[10329]: Sending gratuitous ARP on eth0  for  182.48.115.239
Jun  1 02:12:10 percona1 Keepalived_vrrp[10329]: Sending gratuitous ARP on eth0  for  182.48.115.239
 
发现vip资源已经绑定到182.48.115.236这个master节点机上了
 
--------------在candicate master上启动----------------
[root@Node_Slave ~] # /etc/init.d/keepalived start
正在启动 keepalived:                                      [肯定]
[root@Node_Slave ~] # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
     link /loopback  00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1 /8  scope host lo
     inet6 ::1 /128  scope host
        valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
     link /ether  52:54:00:1b:6e:53 brd ff:ff:ff:ff:ff:ff
     inet 182.48.115.237 /27  brd 182.48.115.255 scope global eth0
     inet6 fe80::5054:ff:fe1b:6e53 /64  scope link
        valid_lft forever preferred_lft forever
 
.....................................................................
从上面的信息能够看到keepalived已经配置成功。
 
注意:
上面两台服务器的keepalived都设置为了BACKUP模式,在keepalived中2种模式,分别是master->backup模式和backup->backup模式。这两种模式有很大区别。
在master->backup模式下,一旦主库宕机,虚拟ip会自动漂移到从库,当主库修复后,keepalived启动后,还会把虚拟ip抢占过来,即便设置了非抢占模式(nopreempt)
抢占ip的动做也会发生。在backup->backup模式下,当主库宕机后虚拟ip会自动漂移到从库上,当原主库恢复和keepalived服务启动后,并不会抢占新主的虚拟ip,即便是
优先级高于从库的优先级别,也不会发生抢占。为了减小ip漂移次数,一般是把修复好的主库当作新的备库。

7.4)MHA引入keepalived(MySQL服务进程挂掉时经过MHA 中止keepalived)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
要想把keepalived服务引入MHA,只须要修改切换是触发的脚本文件master_ip_failover便可,在该脚本中添加在master发生宕机时对keepalived的处理。
编辑脚本 /usr/local/bin/master_ip_failover ,修改后以下:
 
[root@Manager_Slave ~] # vim /usr/local/bin/master_ip_failover
#!/usr/bin/env perl
 
use strict;
use warnings FATAL =>  'all' ;
 
use Getopt::Long;
 
my (
     $ command ,          $ssh_user,        $orig_master_host, $orig_master_ip,
     $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);
 
my $vip =  '182.48.115.239' ;
my $ssh_start_vip =  "/etc/init.d/keepalived start" ;
my $ssh_stop_vip =  "/etc/init.d/keepalived stop" ;
 
GetOptions(
     'command=s'           => \$ command ,
     'ssh_user=s'          => \$ssh_user,
     'orig_master_host=s'  => \$orig_master_host,
     'orig_master_ip=s'    => \$orig_master_ip,
     'orig_master_port=i'  => \$orig_master_port,
     'new_master_host=s'   => \$new_master_host,
     'new_master_ip=s'     => \$new_master_ip,
     'new_master_port=i'   => \$new_master_port,
);
 
exit  &main();
 
sub main {
 
     print  "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n" ;
 
     if  ( $ command  eq  "stop"  || $ command  eq  "stopssh"  ) {
 
         my $exit_code = 1;
         eval  {
             print  "Disabling the VIP on old master: $orig_master_host \n" ;
             &stop_vip();
             $exit_code = 0;
         };
         if  ($@) {
             warn  "Got Error: $@\n" ;
             exit  $exit_code;
         }
         exit  $exit_code;
     }
     elsif ( $ command  eq  "start"  ) {
 
         my $exit_code = 10;
         eval  {
             print  "Enabling the VIP - $vip on the new master - $new_master_host \n" ;
             &start_vip();
             $exit_code = 0;
         };
         if  ($@) {
             warn $@;
             exit  $exit_code;
         }
         exit  $exit_code;
     }
     elsif ( $ command  eq  "status"  ) {
         print  "Checking the Status of the script.. OK \n" ;
         #`ssh $ssh_user\@cluster1 \" $ssh_start_vip \"`;
         exit  0;
     }
     else  {
         &usage();
         exit  1;
     }
}
 
# A simple system call that enable the VIP on the new master
sub start_vip() {
     ` ssh  $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
      return  0  unless  ($ssh_user);
     ` ssh  $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}
 
sub usage {
     print
     "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n" ;
}
 
 
如今已经修改这个脚本了,如今打开在上面提到过的参数,再检查集群状态,看是否会报错
[root@Manager_Slave ~] # grep 'master_ip_failover_script' /etc/masterha/app1.cnf
master_ip_failover_script=  /usr/local/bin/master_ip_failover
 
[root@Manager_Slave ~] # masterha_check_repl --conf=/etc/masterha/app1.cnf
.......
Checking the Status of the script.. OK
Thu Jun  1 03:31:57 2017 - [info]  OK.
Thu Jun  1 03:31:57 2017 - [warning] shutdown_script is not defined.
Thu Jun  1 03:31:57 2017 - [info] Got  exit  code 0 (Not master dead).
 
MySQL Replication Health is OK.
 
能够看出复制状况正常!
/usr/local/bin/master_ip_failover 添加或者修改的内容意思是当主库数据库发生故障时,会触发MHA切换,MHA Manager会停掉主库上的keepalived服务,
触发虚拟ip漂移到备选从库,从而完成切换。固然能够在keepalived里面引入脚本,这个脚本监控mysql是否正常运行,若是不正常,则调用该脚本杀掉keepalived进程。 

第二种方式:经过脚本的方式管理VIP
这里是修改/usr/local/bin/master_ip_failover,修改完成后内容以下。还须要手动在master服务器上绑定一个vip

1)如今master节点上绑定vip

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@Master_node ~] # ifconfig eth0:0 182.48.115.239/27            //本机子网掩码是27,通常都是24
[root@Master_node ~] # ifconfig
eth0      Link encap:Ethernet  HWaddr 52:54:00:5F:58:DC 
           inet addr:182.48.115.236  Bcast:182.48.115.255  Mask:255.255.255.224
           inet6 addr: fe80::5054:ff:fe5f:58dc /64  Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:25505 errors:0 dropped:0 overruns:0 frame:0
           TX packets:3358 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:3254957 (3.1 MiB)  TX bytes:482420 (471.1 KiB)
 
eth0:0    Link encap:Ethernet  HWaddr 52:54:00:5F:58:DC 
           inet addr:182.48.115.239  Bcast:182.48.115.255  Mask:255.255.255.224
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 
lo        Link encap:Local Loopback 
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1 /128  Scope:Host
           UP LOOPBACK RUNNING  MTU:65536  Metric:1
           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

2)manager节点修改/usr/local/bin/master_ip_failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
[root@Manager_Slave ~] # cat /usr/local/bin/master_ip_failover
#!/usr/bin/env perl
 
use strict;
use warnings FATAL =>  'all' ;
 
use Getopt::Long;
 
my (
     $ command ,          $ssh_user,        $orig_master_host, $orig_master_ip,
     $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);
 
my $vip =  '182.48.115.239/27' ;
my $key =  '1' ;
my $ssh_start_vip =  "/sbin/ifconfig eth0:$key $vip" ;
my $ssh_stop_vip =  "/sbin/ifconfig eth0:$key down" ;
 
GetOptions(
     'command=s'           => \$ command ,
     'ssh_user=s'          => \$ssh_user,
     'orig_master_host=s'  => \$orig_master_host,
     'orig_master_ip=s'    => \$orig_master_ip,
     'orig_master_port=i'  => \$orig_master_port,
     'new_master_host=s'   => \$new_master_host,
     'new_master_ip=s'     => \$new_master_ip,
     'new_master_port=i'   => \$new_master_port,
);
 
exit  &main();
 
sub main {
 
     print  "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n" ;
 
     if  ( $ command  eq  "stop"  || $ command  eq  "stopssh"  ) {
 
         my $exit_code = 1;
         eval  {
             print  "Disabling the VIP on old master: $orig_master_host \n" ;
             &stop_vip();
             $exit_code = 0;
         };
         if  ($@) {
             warn  "Got Error: $@\n" ;
             exit  $exit_code;
         }
         exit  $exit_code;
     }
     elsif ( $ command  eq  "start"  ) {
 
         my $exit_code = 10;
         eval  {
             print  "Enabling the VIP - $vip on the new master - $new_master_host \n" ;
             &start_vip();
             $exit_code = 0;
         };
         if  ($@) {
             warn $@;
             exit  $exit_code;
         }
         exit  $exit_code;
     }
     elsif ( $ command  eq  "status"  ) {
         print  "Checking the Status of the script.. OK \n" ;
         exit  0;
     }
     else  {
         &usage();
         exit  1;
     }
}
 
sub start_vip() {
     ` ssh  $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
      return  0  unless  ($ssh_user);
     ` ssh  $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}
 
sub usage {
     print
     "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n" ;
}
 
注意要将 /etc/masterha/app1 .cnf文件中的master_ip_failover_script注释打开
为了防止脑裂发生,推荐生产环境采用脚本的方式来管理虚拟ip,而不是使用keepalived来完成。到此为止,基本MHA集群已经配置完毕。
接下来就是实际的测试环节了。经过一些测试来看一下MHA究竟是如何进行工做的。

8)failover故障切换 
1)自动切换(必须先启动MHA Manager,不然没法自动切换。(固然手动切换不须要开启MHA Manager监控))

1
2
3
4
5
1)在master主库上使用sysbench生成测试数据
[root@Master_node ~] # yum install sysbench -y
 
在主库(182.48.115.236)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。
[root@Master_node ~] # sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=1800 --mysql-user=root --mysql-socket=/local/mysql/var/mysql.sock --mysql-password=123456 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex prepare

1.2)在candicate master(182.48.115.237)上停掉slave sql线程,模拟主从延时。

1
2
3
4
mysql> stop slave io_thread;
Query OK, 0 rows affected (0.08 sec)
 
注意:另一台slave没有中止io线程,因此还在继续接收日志。

1.3)模拟sysbench压力测试

1
2
在主库上(182.48.115.236)进行压力测试,持续时间为3分钟,产生大量的binlog
[root@Master_node ~] # sysbench --test=oltp --oltp-table-size=1000000 --oltp-read-only=off --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=180 --mysql-user=root --mysql-socket=/local/mysql/var/mysql.sock --mysql-password=123456 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex run

1.4)开启在candicate master(182.48.115.237)上的IO线程,追赶落后于master的binlog。

1
2
mysql> start slave io_thread;    
Query OK, 0 rows affected (0.00 sec)

1.5)杀掉主库(182.48.115.236)mysql进程,模拟主库发生故障,进行自动failover操做。

1
[root@Master_node ~] # pkill -9 mysqld

1.6)查看MHA切换日志,了解整个切换过程。在manager管理节点(182.48.115.238)上查看日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[root@Manager_Slave ~] # cat /var/log/masterha/app1/manager.log
........
........
----- Failover Report -----
 
app1: MySQL Master failover 182.48.115.236 to 182.48.115.237 succeeded
 
Master 182.48.115.236 is down!
 
Check MHA Manager logs at server01: /var/log/masterha/app1/manager .log  for  details.
 
Started automated(non-interactive) failover.
Invalidated master IP address on 182.48.115.236.
The latest slave 182.48.115.237(182.48.115.237:3306) has all relay logs  for  recovery.
Selected 182.48.115.237 as a new master.
182.48.115.237: OK: Applying all logs succeeded.
182.48.115.237: OK: Activated master IP address.
192.168.0.70: This host has the latest relay log events.
Generating relay  diff  files from the latest slave succeeded.
192.168.0.70: OK: Applying all logs succeeded. Slave started, replicating from 182.48.115.237.
182.48.115.237: Resetting slave info succeeded.
Master failover to 182.48.115.237(182.48.115.237:3306) completed successfully.
 
看到最后的Master failover to 182.48.115.237(182.48.115.237:3306) completed successfully.说明备选master如今已经上位了。
 
从上面的输出能够看出整个MHA的切换过程,共包括如下的步骤:
1)配置文件检查阶段,这个阶段会检查整个集群配置文件配置
2)宕机的master处理,这个阶段包括虚拟ip摘除操做,主机关机操做(这个我这里尚未实现,须要研究)
3)复制dead maste和最新slave相差的relay log,并保存到MHA Manger具体的目录下
4)识别含有最新更新的slave
5)应用从master保存的二进制日志事件(binlog events)
6)提高一个slave为新的master进行复制
7)使其余的slave链接新的master进行复制
 
最后启动MHA Manger监控,查看集群里面如今谁是master
[root@Manager_Slave ~] # masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:13301) is running(0:PING_OK), master:182.48.115.237

2)手动Failover(MHA Manager必须没有运行)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
手动failover,这种场景意味着在业务上没有启用MHA自动切换功能,当主服务器故障时,人工手动调用MHA来进行故障切换操做,具体命令以下:
 
确保mha manager关闭
[root@Manager_Slave ~] # masterha_stop --conf=/etc/masterha/app1.cnf
 
注意:若是MHA manager检测到没有dead的server,将报错,并结束failover:
[root@Manager_Slave ~] # masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=182.48.115.236 --dead_master_port=3306 --new_master_host=182.48.115.237 --new_master_port=3306 --ignore_last_failover
输出的信息会询问你是否进行切换:
........
----- Failover Report -----
 
app1: MySQL Master failover 182.48.115.236 to 182.48.115.237 succeeded
 
Master 182.48.115.236 is down!
 
Check MHA Manager logs at server01  for  details.
 
Started manual(interactive) failover.
Invalidated master IP address on 182.48.115.236.
The latest slave 182.48.115.237(182.48.115.237:3306) has all relay logs  for  recovery.
Selected 182.48.115.237 as a new master.
182.48.115.237: OK: Applying all logs succeeded.
182.48.115.237: OK: Activated master IP address.
192.168.0.70: This host has the latest relay log events.
Generating relay  diff  files from the latest slave succeeded.
192.168.0.70: OK: Applying all logs succeeded. Slave started, replicating from 182.48.115.237.
182.48.115.237: Resetting slave info succeeded.
Master failover to 182.48.115.237(182.48.115.237:3306) completed successfully.
 
这样即模拟了master宕机的状况下手动把192.168.0.60提高为主库的操做过程。

9)在线进行切换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
在许多状况下, 须要将现有的主服务器迁移到另一台服务器上,好比主服务器硬件故障,RAID 控制卡须要重建,将主服务器移到性能更好的服务器上等等。维护主服务器引发性能降低,
致使停机时间至少没法写入数据。 另外, 阻塞或杀掉当前运行的会话会致使主主之间数据不一致的问题发生。 MHA提供快速切换和优雅的阻塞写入,这个切换过程只须要 0.5-2s 的时
间,这段时间内数据是没法写入的。在不少状况下,0.5-2s 的阻塞写入是能够接受的。所以切换主服务器不须要计划分配维护时间窗口。
 
MHA在线切换的大概过程:
1)检测复制设置和肯定当前主服务器
2)肯定新的主服务器
3)阻塞写入到当前主服务器
4)等待全部从服务器遇上复制
5)授予写入到新的主服务器
6)从新设置从服务器
 
注意,在线切换的时候应用架构须要考虑如下两个问题:
1)自动识别master和slave的问题(master的机器可能会切换),若是采用了vip的方式,基本能够解决这个问题。
2)负载均衡的问题(能够定义大概的读写比例,每台机器可承担的负载比例,当有机器离开集群时,须要考虑这个问题)
 
为了保证数据彻底一致性,在最快的时间内完成切换,MHA的在线切换必须知足如下条件才会切换成功,不然会切换失败。
1)全部slave的IO线程都在运行
2)全部slave的SQL线程都在运行
3)全部的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒,若是在切换过程当中不指定running_updates_limit,那么
   默认状况下running_updates_limit为1秒。
4)在master端,经过show processlist输出,没有一个更新花费的时间大于running_updates_limit秒。
 
在线切换步骤以下:
首先,manager节点上停掉MHA监控:
[root@Manager_Slave ~] # masterha_stop --conf=/etc/masterha/app1.cnf
 
其次,进行在线切换操做(模拟在线切换主库操做,原主库182.48.115.236变为slave,182.48.115.237提高为新的主库)
[root@Manager_Slave ~] # masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=182.48.115.237 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
.........
Thu Jun  1 00:28:02 2014 - [info]  Executed CHANGE MASTER.
Thu Jun  1 00:28:02 2014 - [info]  Slave started.
Thu Jun  1 00:28:02 2014 - [info] All new slave servers switched successfully.
Thu Jun  1 00:28:02 2014 - [info]
Thu Jun  1 00:28:02 2014 - [info] * Phase 5: New master cleanup phease..
Thu Jun  1 00:28:02 2014 - [info]
Thu Jun  1 00:28:02 2014 - [info]  192.168.0.60: Resetting slave info succeeded.
Thu Jun  1 00:28:02 2014 - [info] Switching master to 192.168.0.60(192.168.0.60:3306) completed successfully.
 
其中参数的意思:
--orig_master_is_new_slave       切换时加上此参数是将原 master 变为 slave 节点,若是不加此参数,原来的 master 将不启动
--running_updates_limit=10000   故障切换时,候选master 若是有延迟的话, mha 切换不能成功,加上此参数表示延迟在此时间范围内均可切换(单位为s),可是切换的
                                  时间长短是由recover 时relay 日志的大小决定
注意:
因为在线进行切换须要调用到master_ip_online_change这个脚本,可是因为该脚本不完整,须要进行相应的修改,脚本调整以下:
[root@Manager_Slave ~] # vim /usr/local/bin/master_ip_online_change
#!/usr/bin/env perl
 
use strict;
use warnings FATAL =>  'all' ;
 
use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw(  sleep  gettimeofday tv_interval );
use Data::Dumper;
 
my $_tstart;
my $_running_interval = 0.1;
my (
   $ command ,          $orig_master_host, $orig_master_ip,
   $orig_master_port, $orig_master_user,
   $new_master_host,  $new_master_ip,    $new_master_port,
   $new_master_user, 
);
 
 
my $vip =  '182.48.115.239/27' ;   # Virtual IP
my $key =  "1" ;
my $ssh_start_vip =  "/sbin/ifconfig eth1:$key $vip" ;
my $ssh_stop_vip =  "/sbin/ifconfig eth1:$key down" ;
my $ssh_user =  "root" ;
my $new_master_password= '123456' ;
my $orig_master_password= '123456' ;
GetOptions(
   'command=s'               => \$ command ,
   #'ssh_user=s'             => \$ssh_user, 
   'orig_master_host=s'      => \$orig_master_host,
   'orig_master_ip=s'        => \$orig_master_ip,
   'orig_master_port=i'      => \$orig_master_port,
   'orig_master_user=s'      => \$orig_master_user,
   #'orig_master_password=s' => \$orig_master_password,
   'new_master_host=s'       => \$new_master_host,
   'new_master_ip=s'         => \$new_master_ip,
   'new_master_port=i'       => \$new_master_port,
   'new_master_user=s'       => \$new_master_user,
   #'new_master_password=s'  => \$new_master_password,
);
 
exit  &main();
 
sub current_time_us {
   my ( $sec, $microsec ) = gettimeofday();
   my $curdate = localtime($sec);
   return  $curdate .  " "  . sprintf(  "%06d" , $microsec );
}
 
sub sleep_until {
   my $elapsed = tv_interval($_tstart);
   if  ( $_running_interval > $elapsed ) {
     sleep ( $_running_interval - $elapsed );
   }
}
 
sub get_threads_util {
   my $dbh                    =  shift ;
   my $my_connection_id       =  shift ;
   my $running_time_threshold =  shift ;
   my $ type                    shift ;
   $running_time_threshold = 0 unless ($running_time_threshold);
   $ type                    = 0 unless ($ type );
   my @threads;
 
   my $sth = $dbh->prepare( "SHOW PROCESSLIST" );
   $sth->execute();
 
   while  ( my $ref = $sth->fetchrow_hashref() ) {
     my $ id          = $ref->{Id};
     my $user       = $ref->{User};
     my $host       = $ref->{Host};
     my $ command     = $ref->{Command};
     my $state      = $ref->{State};
     my $query_time = $ref->{Time};
     my $info       = $ref->{Info};
     $info =~ s/^\s*(.*?)\s*$/$1/  if  defined($info);
     next  if  ( $my_connection_id == $ id  );
     next  if  ( defined($query_time) && $query_time < $running_time_threshold );
     next  if  ( defined($ command )    && $ command  eq  "Binlog Dump"  );
     next  if  ( defined($user)       && $user  eq  "system user"  );
     next
       if  ( defined($ command )
       && $ command  eq  "Sleep"
       && defined($query_time)
       && $query_time >= 1 );
 
     if  ( $ type  >= 1 ) {
       next  if  ( defined($ command ) && $ command  eq  "Sleep"  );
       next  if  ( defined($ command ) && $ command  eq  "Connect"  );
     }
 
     if  ( $ type  >= 2 ) {
       next  if  ( defined($info) && $info =~ m/^ select /i  );
       next  if  ( defined($info) && $info =~ m/^show /i  );
     }
 
     push @threads, $ref;
   }
   return  @threads;
}
 
sub main {
   if  ( $ command  eq  "stop"  ) {
     ## Gracefully killing connections on the current master
     # 1. Set read_only= 1 on the new master
     # 2. DROP USER so that no app user can establish new connections
     # 3. Set read_only= 1 on the current master
     # 4. Kill current queries
     # * Any database access failure will result in script die.
     my $exit_code = 1;
     eval  {
       ## Setting read_only=1 on the new master (to avoid accident)
       my $new_master_handler = new MHA::DBHelper();
 
       # args: hostname, port, user, password, raise_error(die_on_error)_or_not
       $new_master_handler->connect( $new_master_ip, $new_master_port,
         $new_master_user, $new_master_password, 1 );
       print current_time_us() .  " Set read_only on the new master.. " ;
       $new_master_handler->enable_read_only();
       if  ( $new_master_handler->is_read_only() ) {
         print  "ok.\n" ;
       }
       else  {
         die  "Failed!\n" ;
       }
       $new_master_handler->disconnect();
 
       # Connecting to the orig master, die if any database error happens
       my $orig_master_handler = new MHA::DBHelper();
       $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
         $orig_master_user, $orig_master_password, 1 );
 
       ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
       #$orig_master_handler->disable_log_bin_local();
       #print current_time_us() . " Drpping app user on the orig master..\n";
       #FIXME_xxx_drop_app_user($orig_master_handler);
 
       ## Waiting for N * 100 milliseconds so that current connections can exit
       my $time_until_read_only = 15;
       $_tstart = [gettimeofday];
       my @threads = get_threads_util( $orig_master_handler->{dbh},
         $orig_master_handler->{connection_id} );
       while  ( $time_until_read_only > 0 && $ #threads >= 0 ) {
         if  ( $time_until_read_only % 5 == 0 ) {
           printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n" ,
             current_time_us(), $ #threads + 1, $time_until_read_only * 100;
           if  ( $ #threads < 5 ) {
             print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump .  "\n"
               foreach (@threads);
           }
         }
         sleep_until();
         $_tstart = [gettimeofday];
         $time_until_read_only--;
         @threads = get_threads_util( $orig_master_handler->{dbh},
           $orig_master_handler->{connection_id} );
       }
 
       ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
       print current_time_us() .  " Set read_only=1 on the orig master.. " ;
       $orig_master_handler->enable_read_only();
       if  ( $orig_master_handler->is_read_only() ) {
         print  "ok.\n" ;
       }
       else  {
         die  "Failed!\n" ;
       }
 
       ## Waiting for M * 100 milliseconds so that current update queries can complete
       my $time_until_kill_threads = 5;
       @threads = get_threads_util( $orig_master_handler->{dbh},
         $orig_master_handler->{connection_id} );
       while  ( $time_until_kill_threads > 0 && $ #threads >= 0 ) {
         if  ( $time_until_kill_threads % 5 == 0 ) {
           printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n" ,
             current_time_us(), $ #threads + 1, $time_until_kill_threads * 100;
           if  ( $ #threads < 5 ) {
             print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump .  "\n"
               foreach (@threads);
           }
         }
         sleep_until();
         $_tstart = [gettimeofday];
         $time_until_kill_threads--;
         @threads = get_threads_util( $orig_master_handler->{dbh},
           $orig_master_handler->{connection_id} );
       }
 
 
 
                 print  "Disabling the VIP on old master: $orig_master_host \n" ;
                 &stop_vip();    
 
 
       ## Terminating all threads
       print current_time_us() .  " Killing all application threads..\n" ;
       $orig_master_handler->kill_threads(@threads)  if  ( $ #threads >= 0 );
       print current_time_us() .  " done.\n" ;
       #$orig_master_handler->enable_log_bin_local();
       $orig_master_handler->disconnect();
 
       ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
       $exit_code = 0;
     };
     if  ($@) {
       warn  "Got Error: $@\n" ;
       exit  $exit_code;
     }
     exit  $exit_code;
   }
   elsif ( $ command  eq  "start"  ) {
     ## Activating master ip on the new master
     # 1. Create app user with write privileges
     # 2. Moving backup script if needed
     # 3. Register new master's ip to the catalog database
 
# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
     my $exit_code = 10;
     eval  {
       my $new_master_handler = new MHA::DBHelper();
 
       # args: hostname, port, user, password, raise_error_or_not
       $new_master_handler->connect( $new_master_ip, $new_master_port,
         $new_master_user, $new_master_password, 1 );
 
       ## Set read_only=0 on the new master
       #$new_master_handler->disable_log_bin_local();
       print current_time_us() .  " Set read_only=0 on the new master.\n" ;
       $new_master_handler->disable_read_only();
 
       ## Creating an app user on the new master
       #print current_time_us() . " Creating app user on the new master..\n";
       #FIXME_xxx_create_app_user($new_master_handler);
       #$new_master_handler->enable_log_bin_local();
       $new_master_handler->disconnect();
 
       ## Update master ip on the catalog database, etc
                 print  "Enabling the VIP - $vip on the new master - $new_master_host \n" ;
                 &start_vip();
                 $exit_code = 0;
     };
     if  ($@) {
       warn  "Got Error: $@\n" ;
       exit  $exit_code;
     }
     exit  $exit_code;
   }
   elsif ( $ command  eq  "status"  ) {
 
     # do nothing
     exit  0;
   }
   else  {
     &usage();
     exit  1;
   }
}
 
# A simple system call that enable the VIP on the new master
sub start_vip() {
     ` ssh  $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
     ` ssh  $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}
 
sub usage {
   print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n" ;
   die;
}

10)修复宕机后的master节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#!/usr/bin/perl
 
#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
 
## Note: This is a sample script and is not complete. Modify the script based on your environment.
 
use strict;
use warnings FATAL =>  'all' ;
use Mail::Sender;
use Getopt::Long;
 
#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp= 'smtp.163.com' ;
my $mail_from= 'xxxx' ;
my $mail_user= 'xxxxx' ;
my $mail_pass= 'xxxxx' ;
my $mail_to=[ 'xxxx' , 'xxxx' ];
GetOptions(
   'orig_master_host=s'  => \$dead_master_host,
   'new_master_host=s'   => \$new_master_host,
   'new_slave_hosts=s'   => \$new_slave_hosts,
   'subject=s'           => \$subject,
   'body=s'              => \$body,
);
 
mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);
 
sub mailToContacts {
     my ( $smtp, $mail_from, $user, $ passwd , $mail_to, $subject, $msg ) = @_;
     open  my $DEBUG,  "> /tmp/monitormail.log"
         or die  "Can't open the debug      file:$!\n" ;
     my $sender = new Mail::Sender {
         ctype       =>  'text/plain; charset=utf-8' ,
         encoding    =>  'utf-8' ,
         smtp        => $smtp,
         from        => $mail_from,
         auth        =>  'LOGIN' ,
         TLS_allowed =>  '0' ,
         authid      => $user,
         authpwd     => $ passwd ,
         to          => $mail_to,
         subject     => $subject,
         debug       => $DEBUG
     };
 
     $sender->MailMsg(
         {   msg   => $msg,
             debug => $DEBUG
         }
     ) or print $Mail::Sender::Error;
     return  1;
}
 
 
 
# Do whatever you want here
 
exit  0;

目前高可用方案能够必定程度上实现数据库的高可用,出于对数据库的高可用和数据一致性的要求,推荐使用MHA架构。

相关文章
相关标签/搜索