MMM结合Semisync机制实现Mysql Master-Master高可用

架构:node

两个Master(主备模式),一个或多个Slave(也能够没有Slave,只有主备Master):mysql

一、Monitor运行MMM Daemon程序,实现全部Mysql服务器的监控和故障切换工做;sql

二、Master1和Master2互为主备,同时只有一个主可用于写操做(也可同时分担读操做),另外一个做为备用,能够分担读操做,读写分离须要应用程序实现;数据库

三、Slave机器与当前active Master同步,如当前active Master故障后,Master将切换到passive Master,同时MMM修改Slave与新的Master同步;服务器

四、Application经过write和read ip进行读写操做;网络



环境:session

主机名 服务器IP地址 Write IP Read IP 备注
mysql01 10.0.60.100 10.0.60.160 10.0.60.161 默认为active Master,运行mmm agent
mysql02 10.0.60.101   10.0.60.162 默认为passive  Master,运行mmm agent
mysql03 10.0.60.102   10.0.60.163 Slave,由MMM维护,运行mmm agent
mysql04 10.0.60.103     监控机,运行MMM  Daemon程序


软件信息:架构

Mysql:5.6.17-log MySQL Community Server (GPL)并发

MMM:mysql-mmm-2.2.1app

OS:CentOS release 6.4 (Final),kernel 2.6.32-358.el6.x86_64


1、配置复制环境

这里是全新配置,若是是已经存在了单master和slave环境,将配置不同,能够结合xtrabackup工具实现数据的备份和恢复,配置主备master环境。

前提要求:

一、全部mysql实例开启read_only=1;

二、主备master须要开启log_bin;

三、全部mysql实例配置不一样的server_id以及不一样的二进制日志、relay日志文件名;


参考配置参数:

mysql01的特殊配置参数:

mysql01>\! grep -E "log_bin|server_id|read_only" my.cnf
log_bin = mysql01-bin
server_id = 1
read_only
mysql01>


mysql02的特殊配置参数:

mysql02>\! grep -E "log_bin|server_id|read_only" my.cnf
log_bin = mysql02-bin
server_id = 2
read_only
mysql02>


mysql03的特殊配置参数:

mysql03>\! grep -E "log_bin|server_id|read_only" my.cnf
log_bin = mysql_bin
server_id = 3
read_only


配置主从:

一、配置mysql01和mysql02互为主从:

在mysql01和mysql02上建立一样的复制帐号:

grant replication slave on *.* to 'repl'@'10.0.60.%' identified by 'repl';


查看master状态:

show master status;



在每一个节点执行CHANGE MASTER TO语句:

mysql01change master to master_host = '10.0.60.101'
master_user='repl'
master_password='repl'
master_log_file='mysql02-bin.000001'
master_log_pos=545;


mysql02> change master to master_host = '10.0.60.100'
master_user='repl'
master_password='repl'
master_log_file='mysql01-bin.000001'
master_log_pos=545;


在两个节点开启slave:

start slave;


查看slave状态是否正常:

mysql02>show slave status\G;
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes


二、配置mysql03为mysql01的从服务器

mysql03> change master to master_host = '10.0.60.100'
master_user='repl'
master_password='repl'
master_log_file='mysql01-bin.000001'
master_log_pos=545;


验证slave状态正常后,开始下面的步骤。


2、配置半同步

使用半同步机制,能够确保至少一台slave收到master的二进制日志,在必定程度上保证了数据的一致性,减小了当master当机时,形成数据丢失。

半同步机制由google贡献,从mysql 5.5开始原生支持该特性。


前提要求:

一、主备master都要安装并开启semisync master和slave,因mmm不能进行semisync配置和管理;

二、slave须要安装并开启semisync slave;


配置步骤:

一、mysql01和mysql02安装semisync master和slave插件:

mysql01>INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
Query OK, 0 rows affected (0.01 sec)

mysql01>INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
Query OK, 0 rows affected (0.00 sec)


mysql02>INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
Query OK, 0 rows affected (0.05 sec)

mysql02>INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
Query OK, 0 rows affected (0.01 sec)


二、mysql01和mysql02开启semisync master和slave:

mysql01>SET GLOBAL rpl_semi_sync_master_enabled = 1;
Query OK, 0 rows affected (0.00 sec)

mysql01>SET GLOBAL rpl_semi_sync_slave_enabled = 1;
Query OK, 0 rows affected (0.00 sec)


mysql02>SET GLOBAL rpl_semi_sync_master_enabled = 1;
Query OK, 0 rows affected (0.00 sec)

mysql02>SET GLOBAL rpl_semi_sync_slave_enabled = 1;
Query OK, 0 rows affected (0.00 sec)


同时将参数写入到配置文件,以mysql实例开启时自动开启半同步:

mysql02>\! cat my.cnf|grep semi
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_slave_enabled = 1


三、mysql03安装并开启semisync slave插件:

mysql03>INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
Query OK, 0 rows affected (0.01 sec)

mysql03>SET GLOBAL rpl_semi_sync_slave_enabled = 1;
Query OK, 0 rows affected (0.00 sec)


同时将参数写入到配置文件,以mysql实例开启时自动开启半同步:

mysql03>\! cat my.cnf|grep semi
rpl_semi_sync_slave_enabled = 1
mysql03>


四、全部mysql实例中止slave并开启slave,使半同步机制生效:

stop slave;start slave;


五、查看semisync状态

mysql01>show status like '%emi%';
+--------------------------------------------+-------+
| Variable_name                              | Value |
+--------------------------------------------+-------+
| Rpl_semi_sync_master_clients               | 2     |
| Rpl_semi_sync_master_net_avg_wait_time     | 0     |
| Rpl_semi_sync_master_net_wait_time         | 0     |
| Rpl_semi_sync_master_net_waits             | 0     |
| Rpl_semi_sync_master_no_times              | 0     |
| Rpl_semi_sync_master_no_tx                 | 0     |
| Rpl_semi_sync_master_status                | ON    |
| Rpl_semi_sync_master_timefunc_failures     | 0     |
| Rpl_semi_sync_master_tx_avg_wait_time      | 0     |
| Rpl_semi_sync_master_tx_wait_time          | 0     |
| Rpl_semi_sync_master_tx_waits              | 0     |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0     |
| Rpl_semi_sync_master_wait_sessions         | 0     |
| Rpl_semi_sync_master_yes_tx                | 0     |
| Rpl_semi_sync_slave_status                 | ON    |
+--------------------------------------------+-------+
15 rows in set (0.00 sec)


3、配置MMM

Multi Master Replication Manager for Mysql(MMM)是一套开源的perl脚本,对Mysql Master-Master复制环境(在任什么时候刻只有一个节点可写)进行监控、故障恢复以及管理。同时能根据复制的延时状况管理读负载均衡,经过迁移read虚拟IP地址。同时也能用于数据备份,以及节点之间重同步。

主要由三个脚本组成:

一、mmm_mod:监控daemon程序,进行监控工做,并决定读、写角色的迁移;最好运行在专用的监控服务器上,能够管理多套Master-Slave集群。

二、mmm_agentd:客户端daemon程序,运行在全部mysql实例服务器,与监控节点进行简单的远程通讯。

三、mmm_control:用于管理mmm_mond进程的命令行脚本。


前提需求:

一、支持环境:

两个节点的Master-Master环境,MMM须要5个IP地址(每一个节点一个固定IP地址,一个write IP地址,两个read IP地址,write和read IP依据节点的可用性进行自动的迁移),正常状况下,active master有一个write IP和一个read IP地址,standby master有一个read IP地址,若是当前active master故障,write和read IP地址将迁移到standby master;

两个节点的Master-Master,以及一个或多个slave的环境,同时也是大多数企业使用的方案(能够更好的扩展读,同时有冗余的Slave可用于备份等工做,防止阻塞正常的事务)。

二、n+1个主机:n个运行mysql实例的服务器,一个机器用于运行MMM监控daemon程序;

三、2*(n+1) IP地址:每一个主机一个固定IP地址,同时每台mysql实例一个read IP地址以及一个write IP地址;

四、monitor数据库用户:须要REPLICATION CLIENT权限,用于MMM监控(mmm_mond);

五、agent数据库用户:须要SUPER、REPLICATION CLIENT、PROCESS权限,用于MMM 客户端(mmm_agentd),能够只针对本机IP进行受权;

六、relication数据库用户:须要REPLICATION SLAVE权限,用于mysql复制;

七、tools数据库用户:须要SUPER、REPLICATION CLIENT、RELOAD权限,用于MMM tools(如mmm_backup、mmm_clone、mmm_restore)


一、在mysql实例服务器安装依赖包和mmm

安装依赖包:

yum -y install perl iproute perl-Algorithm-Diff perl-DBI perl-Class-Singleton perl-DBD-MySQL perl-Log-Log4perl perl-Log-Dispatch perl-Proc-Daemon perl-MailTools perl-Time-HiRes perl-Mail-Sendmail perl-Mail-Sender perl-Email-Date-Format perl-MIME-Lite perl-Net-ARP


若是在标准软件仓库和EPEL软件仓库没有,须要单独下载,能够去如下网址下载:

http://rpm.pbone.net

http://search.cpan.org/

http://www.rpmfind.net/


安装mmm:

tar xvf mysql-mmm-2.2.1.tar.gz
cd mysql-mmm-2.2.1
make install


二、在mysql实例服务器配置mmm agent

mmm_agentd使用mmm_agent.conf配置文件:

# cat /etc/mysql-mmm/mmm_agent.conf   
include mmm_common.conf #包含这个公用配置文件
#Description: name of this host,能够不是主机名,每台mysql实例的host不一样(如mysql01设置为db1,mysql02设置为db2,mysql03设置为db3)
this db1
#Description: Enable debug mode,设置1,打印日志到前台,按ctrl+c将结束进程
debug 0
#Description: Maximum number of retries when killing threads to prevent further
#writes during the removal of the active_master_role.
max_kill_retries 10


公用配置文件:mmm_common.conf,每一个实例同样,并要拷贝到监控服务器供mmm_mond使用,进行网卡接口的定义,每一个主机的描述,复制和mmm agent的用户名和密码配置,以及读写规则等

# cat /etc/mysql-mmm/mmm_common.conf 
#Description: name of the role for which identifies the active master,定义活动master为可写
active_master_role                              writer

<host default> #默认段
        #Description: network interface on which the IPs of the roles should be configured,用于绑定ip的网络接口
        cluster_interface                       eth0

        pid_path                                /var/run/mmm_agentd.pid
        bin_path                                /usr/lib/mysql-mmm/
        #Description: Port on which mmm agentd listens
        agent_port                              9989
        #Description: Port on which mysqld is listening
        mysql_port                              3306

        #Description: mysql user used for replication
        replication_user                        repl
        #Description: mysql password used for replication
        replication_password                    repl
        #Description: mysql user for MMM Agent
        agent_user                              mmm_agent
        #Description: mysql password for MMM Agent
        agent_password                          mmm_agent
</host>

<host db1> #命名段,指定每一个mysql实例主机
        #Description: IP of host
        ip                                      10.0.60.100
        #Description: Mode of host. Either master or slave.
        mode                                    master
        #Description: Name of peer host (if mode is master)
        peer                                    db2
</host>

<host db2>
        ip                                      10.0.60.101
        mode                                    master
        peer                                    db1
</host>

<host db3>
        ip                                      10.0.60.102
        mode                                    slave
</host>


<role writer> #定义write角色
        #Description: Hosts which may take over the role
        hosts                                   db1, db2
        #Description: One or multiple IPs associated with the role,指定浮动write IP地址
        ips                                     10.0.60.160
        #Description: Mode of role. Either balanced or exclusive
        mode                                    exclusive
        #Description: The preferred host for this role. Only allowed for exclusive roles.
        #prefer                                 -
</role>

<role reader>  #定义read角色
        hosts                                   db1, db2, db3
        ips                                     10.0.60.161,10.0.60.162,10.0.60.163 #浮动read IP地址
        mode                                    balanced
</role>


三、启动mmm agent服务

/etc/init.d/mysql-mmm-agent start
chkconfig --level 2345 mysql-mmm-agent on


四、在监控服务器(mysql04)安装依赖包和mmm

安装依赖包:

yum -y install perl iproute perl-Algorithm-Diff perl-DBI perl-Class-Singleton perl-DBD-MySQL perl-Log-Log4perl perl-Log-Dispatch perl-Proc-Daemon perl-MailTools perl-Time-HiRes perl-Mail-Sendmail perl-Mail-Sender perl-Email-Date-Format perl-MIME-Lite perl-Net-Ping


安装mmm:

tar xvf mysql-mmm-2.2.1.tar.gz
cd mysql-mmm-2.2.1
make install


五、配置mmm 监控配置文件

mmm_mond和mmm_control使用mmm_mon.conf或mmm_mon_CLUSTER.conf配置文件

mmm_mon.conf配置文件参考:

# cat /etc/mysql-mmm/mmm_mon.conf 
include mmm_common.conf

#The monitor section is required by mmm_mond and mmm_control
<monitor>
        #Description: IP on which mmm_mond listens
        ip                                              127.0.0.1
        #Description: Port on which mmm mond listens
        port                                    9988
        #Description: Location of pid-file
        pid_path                                /var/run/mmm_mond.pid
        #Description: Path to directory containing MMM binaries
        bin_path                                /usr/lib/mysql-mmm/
        #Description: Location of of status file
        status_path                             /var/lib/misc/mmm_mond.status
        #Description: Break between network checks
        ping_interval                           1
        #Description: IPs used for network checks,指定全部mysql服务器IP,write和read IP地址,用于进行ping检查
        ping_ips                                10.0.60.100, 10.0.60.101, 10.0.60.102, 10.0.60.160, 10.0.60.161, 10.0.60.162, 10.0.60.163
        #Description: Duration in seconds for flap detection. See flap_count
        flap_duration                           3600
        #Description: Maximum number of downtimes within flap_duration seconds after
        #which a host is considered to be flapping.
        flap_count                              3
        #Description: How many seconds to wait before switching node status from
        #AWAITING_RECOVERY to ONLINE. 0 = disabled.
        auto_set_online                         0
        #Description: Binary used to kill hosts if roles couldn’t be removed because the agent
        #was not reachable. You have to provide a custom binary for this which
        #takes the hostname as first argument and the state of check ping (1 -ok; 0 - not ok) as second argument.
        kill_host_bin                           /usr/lib/mysql-mmm/monitor/kill_host
        #Description: Startup carefully i.e. switch into passive mode when writer role is
        #configured on multiple hosts
        careful_startup                         0
        #Description: Default mode of monitor.
        mode                                    active
        #Description: How many seconds to wait for other master to become ONLINE before
        #switching from mode WAIT to mode ACTIVE. 0 = infinite.
        wait_for_other_master                   120
</monitor>

<host default>
        #Description: mysql user for MMM Monitor
        monitor_user                    mmm_agent
        #Description: mysql password for MMM Monitor
        monitor_password                mmm_agent
</host>

<check  mysql> #check段,mmm执行ping、mysql、rep_threads、rep_backlog四种检查,能够分别进行检查间隔等参数配置。
        #Description: Perform check every 5 seconds
        check_period                    5
        #Description: Check is considered as failed if it doesn’t succeed for at least
        #trap period seconds.
        trap_period                     10
        #Description: Check times out after timeout seconds
        timeout                         2
        #Description: Restart checker process after restart after checks
        restart_after                   10000
        #Description: Maximum backlog for check rep_backlog.
        max_backlog                     60
</check>

#设置为1,开启调试模式,打印日志到前台,ctrl+c将结束进程,对于调试有帮助
debug 0


六、开启mmm monitor监控

/etc/init.d/mysql-mmm-monitor start
chkconfig --level 2345 mysql-mmm-monitor on


七、使用mmm_control查看状态

# mmm_control show
  db1(10.0.60.100) master/ONLINE. Roles: reader(10.0.60.163), writer(10.0.60.160)
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.161)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162)


注:

当节点第一次开启,状态为等待恢复。

设置节点online:

# mmm_control set_online db1


MMM如何工做:

当故障发生时,mmm迅速的迁移故障节点的IP地址从一个节点到另外一个节点,并使用Net::ARP Perl模块更新ARP表

处理过程:

在故障active master节点:

一、mysql 设置为read_only(set global read_only=1),防止写事务;

二、中断活动链接;

三、移除写ip;


在新master节点:

一、运行在passive master的mmm进程被通知即将成为active write;

二、slave将尝试从master的二进制日志抓取任何剩余事务;

三、关闭read_only(set global read_only=0);

四、绑定write ip,并发生arp通告;


4、测试

一、测试mysql01 mysql实例故障

手动关闭mysql01服务器上的mysql实例,指望master将迁移到mysql02

中止mysql01的mysqld进程:也可使用"killall -15 mysqld"结束mysqld进程

mysql01>\! sh stop.sh


查看mmm_mond的日志:总共通过10s时间完成迁移

# tail -f /var/log/mysql-mmm/mmm_mond.log 
2014/05/27 14:19:13  WARN Check 'rep_backlog' on 'db1' is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.60.100:3306, user = mmm_agent)! Lost connection to MySQL server at 'reading initial communication packet', system error: 111
2014/05/27 14:19:13  WARN Check 'rep_threads' on 'db1' is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.60.100:3306, user = mmm_agent)! Lost connection to MySQL server at 'reading initial communication packet', system error: 111
2014/05/27 14:19:22 ERROR Check 'mysql' on 'db1' has failed for 10 seconds! Message: ERROR: Connect error (host = 10.0.60.100:3306, user = mmm_agent)! Lost connection to MySQL server at 'reading initial communication packet', system error: 111
2014/05/27 14:19:23 FATAL State of host 'db1' changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK)
2014/05/27 14:19:23  INFO Removing all roles from host 'db1':
2014/05/27 14:19:23  INFO     Removed role 'reader(10.0.60.163)' from host 'db1'
2014/05/27 14:19:23  INFO     Removed role 'writer(10.0.60.160)' from host 'db1'
2014/05/27 14:19:23  INFO Orphaned role 'writer(10.0.60.160)' has been assigned to 'db2' #能够看到写IP已经迁移到mysql02
2014/05/27 14:19:23  INFO Orphaned role 'reader(10.0.60.163)' has been assigned to 'db3'


使用mmm_control命令查看状态:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/HARD_OFFLINE. Roles: #mysql01状态已经变为HARD_OFFLINE,意外着ping错误或mysql故障
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.161), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162), reader(10.0.60.163)


检查mysql02的read_only变量是否改变:

mysql02>show global variables like 'read_only'; #默认在passive master时,read_only为ON
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | OFF   |
+---------------+-------+
1 row in set (0.00 sec)

mysql02>


检查mysql03是否已经将mysql02做为主:

mysql03>show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.60.101 #已经从Mysql02同步
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql02-bin.000014
          Read_Master_Log_Pos: 120
               Relay_Log_File: mysql03-relay-bin.000002
                Relay_Log_Pos: 285
        Relay_Master_Log_File: mysql02-bin.000014
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes


当再次启动mysql01的mysql实例,db1的状态将由HARD_OFFLINE改变为AWAITING_RECOVERY:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/AWAITING_RECOVERY. Roles: 
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.161), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162), reader(10.0.60.163)


须要手动设置为online,mmm才会分配read ip给mysql01,并与mysql02同步:

[root@mysql04 ~]# mmm_control set_online db1
OK: State of 'db1' changed to ONLINE. Now you can wait some time and check its new roles!


mysql01>show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.60.101 #mysql01已经从Mysql02同步
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql02-bin.000015
          Read_Master_Log_Pos: 120
               Relay_Log_File: mysql01-relay.000027
                Relay_Log_Pos: 285
        Relay_Master_Log_File: mysql02-bin.000015
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes


查看mysql02的semisync状态:

mysql02>show status like 'Rpl_semi%';
+--------------------------------------------+-------+
| Variable_name                              | Value |
+--------------------------------------------+-------+
| Rpl_semi_sync_master_clients               | 2     |
| Rpl_semi_sync_master_net_avg_wait_time     | 1053  |
| Rpl_semi_sync_master_net_wait_time         | 2106  |
| Rpl_semi_sync_master_net_waits             | 2     |
| Rpl_semi_sync_master_no_times              | 0     |
| Rpl_semi_sync_master_no_tx                 | 0     |
| Rpl_semi_sync_master_status                | ON    |
| Rpl_semi_sync_master_timefunc_failures     | 0     |
| Rpl_semi_sync_master_tx_avg_wait_time      | 1015  |
| Rpl_semi_sync_master_tx_wait_time          | 1015  |
| Rpl_semi_sync_master_tx_waits              | 1     |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0     |
| Rpl_semi_sync_master_wait_sessions         | 0     |
| Rpl_semi_sync_master_yes_tx                | 1     |
| Rpl_semi_sync_slave_status                 | ON    |
+--------------------------------------------+-------+
15 rows in set (0.00 sec)


二、模拟mysql02(Active Master服务器) kernel panic,指望进行迁移

执行上面的测试后,当前active master为mysql02,使用下面命令模拟kernel panic:

mysql02>\! echo "c" > /proc/sysrq-trigger 


查看mmm_mond日志:总共通过了20s的时间完成迁移

# tail -f /var/log/mysql-mmm/mmm_mond.log
2014/05/27 14:44:42  WARN Check 'rep_threads' on 'db2' is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.60.101:3306, user = mmm_agent)! Can't connect to MySQL server on '10.0.60.101' (4)
2014/05/27 14:44:42  WARN Check '
rep_backlog' on 'db2' is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.60.101:3306, user = mmm_agent)! Can't connect to MySQL server on '10.0.60.101' (4)
2014/05/27 14:44:46 FATAL Can't reach agent on host 'db2'
2014/05/27 14:44:49 ERROR Check '
ping' on 'db2' has failed for 11 seconds! Message: ERROR: Could not ping 10.0.60.101 #ping检查错误
2014/05/27 14:44:55 ERROR Check '
mysql' on 'db2' has failed for 14 seconds! Message: ERROR: Connect error (host = 10.0.60.101:3306, user = mmm_agent)! Can't connect to MySQL server on '10.0.60.101' (4) #mysql检查错误,不能链接
2014/05/27 14:44:59  INFO Check 'ping' on 'db2' is ok!
2014/05/27 14:45:02 FATAL State of host 'db2' changed from ONLINE to HARD_OFFLINE (ping: OK, mysql: not OK) #改变mysql02的状态为HARD_OFFLINE
2014/05/27 14:45:02  INFO Removing all roles from host 'db2':  #移除mysql02的角色
2014/05/27 14:45:02  INFO     Removed role 'reader(10.0.60.161)' from host 'db2'
2014/05/27 14:45:02  INFO     Removed role 'writer(10.0.60.160)' from host 'db2'
2014/05/27 14:45:02 FATAL Agent on host 'db2' is reachable again
2014/05/27 14:45:02  INFO Orphaned role 'writer(10.0.60.160)' has been assigned to 'db1' #分配角色到其余机器,write IP分配到mysql01,永远不会分配到mysql03
2014/05/27 14:45:02  INFO Orphaned role 'reader(10.0.60.161)' has been assigned to 'db3'


使用mmm_control查看状态:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/ONLINE. Roles: reader(10.0.60.163)
  db2(10.0.60.101) master/HARD_OFFLINE. Roles: 
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162)


三、模拟active master服务器网络不通,指望进行迁移,可是网络恢复后,将不会重启slave;

当前active master为mysql01,在mysql01上禁用网卡:

mysql01>\! cat down_net.sh
ifdown eth0
sleep 600
ifup eth0
mysql01>\! sh down_net.sh


查看mmm_mond日志:总共通过了9s完成迁移

2014/05/27 15:02:35  WARN Check 'rep_threads' on 'db1' is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.60.100:3306, user = mmm_agent)! Can't connect to MySQL server on '10.0.60.100' (4)
2014/05/27 15:02:35  WARN Check '
rep_backlog' on 'db1' is in unknown state! Message: UNKNOWN: Connect error (host = 10.0.60.100:3306, user = mmm_agent)! Can't connect to MySQL server on '10.0.60.100' (4)
2014/05/27 15:02:41 FATAL Can't reach agent on host 'db1'
2014/05/27 15:02:41 ERROR Check '
ping' on 'db1' has failed for 11 seconds! Message: ERROR: Could not ping 10.0.60.100  #ping检查错误
2014/05/27 15:02:44 FATAL State of host '
db1' changed from ONLINE to HARD_OFFLINE (ping: not OK, mysql: OK) #改变状态
2014/05/27 15:02:44  INFO Removing all roles from host '
db1': #移除角色
2014/05/27 15:02:44  INFO     Removed role '
reader(10.0.60.163)' from host 'db1'  
2014/05/27 15:02:44  INFO     Removed role '
writer(10.0.60.160)' from host 'db1'
2014/05/27 15:02:44 ERROR Can'
t send offline status notification to 'db1' - killing it!
2014/05/27 15:02:44 FATAL Could not kill host 'db1' - there may be some duplicate ips now! (There's no binary configured for killing hosts.)
2014/05/27 15:02:44  INFO Orphaned role '
writer(10.0.60.160)' has been assigned to 'db2'
2014/05/27 15:02:44  INFO Orphaned role '
reader(10.0.60.163)' has been assigned to 'db3'
2014/05/27 15:02:48 ERROR Check '
mysql' on 'db1' has failed for 14 seconds! Message: ERROR: Connect error (host = 10.0.60.100:3306, user = mmm_agent)! Can't connect to MySQL server on '10.0.60.100' (4)


使用mmm_control查看状态:

[root@mysql04 ~]# mmm_control show
# Warning: agent on host db1 is not reachable
  db1(10.0.60.100) master/HARD_OFFLINE. Roles: 
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.161), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162), reader(10.0.60.163)


当网络恢复后,mmm会修改mysql01的slave配置,修改主为mysql02,可是没有重启slave,形成不能进行数据同步,须要手工从新开启slave。

使用mmm_control检查状态:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/AWAITING_RECOVERY. Roles: 
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.161), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162), reader(10.0.60.163)

[root@mysql04 ~]# mmm_control set_online db1  #网络恢复后,手动设置为online
OK: State of 'db1' changed to ONLINE. Now you can wait some time and check its new roles!
[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/ONLINE. Roles: reader(10.0.60.163)
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.161), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.162)


检查mysql01的slave状态:看上去正常的

mysql01>show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.60.101
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql02-bin.000015
          Read_Master_Log_Pos: 328
               Relay_Log_File: mysql01-relay.000027
                Relay_Log_Pos: 493
        Relay_Master_Log_File: mysql02-bin.000015
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes


可是在mysql02(当前active master)插入数据,mysql01不能从mysql02同步:

mysql02> insert into t1 values(2);
Query OK, 1 row affected (0.02 sec)
mysql02>select * from t1;
+----+
| id |
+----+
|  1 |
|  2 |
+----+
2 rows in set (0.00 sec)


mysql03已经同步了数据:

mysql03>select * from t1;
+----+
| id |
+----+
|  1 |
|  2 |
+----+
2 rows in set (0.00 sec)


而mysql01没有同步数据:

mysql01>select * from t1;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.00 sec)


解决方法:先中止slave,而后启动slave;

mysql01>stop slave;start slave; 
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

mysql01>select * from t1;
+----+
| id |
+----+
|  1 |
|  2 |
+----+
2 rows in set (0.00 sec)


截图:


四、模拟slave线程故障,不论是io或sql线程故障,指望进行迁移,恢复时若是在flap_duration时间内超过了flap_count次数的故障,将不会自动恢复,状态由REPLICATION_FAIL改成 AWAITING_RECOVERY (because it's flapping)

当前active master为mysql02。

中止active master(mysql02)的slave,不会形成迁移:

mmm_mond的日志:已经检测到db2(mysql02)复制线程错误

2014/05/27 15:39:02 ERROR Check 'rep_threads' on 'db2' has failed for 10 seconds! Message: ERROR: Replication is broken


使用mmm_control查看状态:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/ONLINE. Roles: reader(10.0.60.161)
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.162), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.163)


若是当其余slave(mysql0一、mysql03)的线程(不论是io仍是sql线程)故障将会发生迁移:

手工中止io线程:

mysql01>stop slave io_thread;
Query OK, 0 rows affected (0.01 sec)


查看mmm_mond日志:

2014/05/27 15:43:28 ERROR Check 'rep_threads' on 'db1' has failed for 10 seconds! Message: ERROR: Replication is broken
2014/05/27 15:43:31 FATAL State of host 'db1' changed from ONLINE to REPLICATION_FAIL
2014/05/27 15:43:31  INFO Removing all roles from host 'db1':
2014/05/27 15:43:31  INFO     Removed role 'reader(10.0.60.161)' from host 'db1'  #移除角色
2014/05/27 15:43:31  INFO Orphaned role 'reader(10.0.60.161)' has been assigned to 'db3'


使用mmm_control查看状态:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/REPLICATION_FAIL. Roles: 
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.162), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.161), reader(10.0.60.163)


当从新开启io线程后,mmm将自动恢复db1,并从新迁移read IP到db1(mysql01)上,若是故障超过:

从新开启线程:

mysql01>start slave io_thread;  
Query OK, 0 rows affected (0.00 sec)


查看mmm_mond日志:

2014/05/27 15:45:23  INFO Check 'rep_threads' on 'db1' is ok!
2014/05/27 15:45:25 FATAL State of host 'db1' changed from REPLICATION_FAIL to ONLINE
2014/05/27 15:45:25  INFO Moving role 'reader(10.0.60.163)' from host 'db3' to host 'db1'


使用mmm_control查看状态:

[root@mysql04 ~]# mmm_control show
  db1(10.0.60.100) master/ONLINE. Roles: reader(10.0.60.163)
  db2(10.0.60.101) master/ONLINE. Roles: reader(10.0.60.162), writer(10.0.60.160)
  db3(10.0.60.102) slave/ONLINE. Roles: reader(10.0.60.161)


五、复制延时

延时检查有max_backlog控制,默认为60;

复制延时或错误,若是故障时间少于60s,状态为ONLINE,单会迁移,故障恢复后,mmm自动恢复read IP。若是rep_backlog和rel_threads同时错误,状态将为REPLICATION_FAIL。


六、mmm agent或monitor故障

不会迁移角色,若是此时有master或slave故障,也将不能迁移角色


参考:

MMM官网:http://mysql-mmm.org/

MMM博客:http://blog.mysql-mmm.org/





来自为知笔记(Wiz)

相关文章
相关标签/搜索