MHA-Failover(GTID，Auto_Position=0)

时间 2020-06-03

标签 mha failover gtid auto position 栏目负载均衡繁體版

原文原文链接

最近一位同窗遇到的案例：凌晨数据库意外宕机，要求在一主两从的基础上，搭建MHA作故障切换。在部署测试中遇到一些问题找到我，交流的过程挖出一些以前忽略的坑，感谢这位同窗无私分享！
• GTID环境，KILL主库，新主库和从库丢失数据(以前已知)
• 在数据库进程挂掉、数据库服务器关机或重启、开启防火墙、关闭网络服务等情况下，测试MHA是否正常切换(以前没考虑脑裂问题)
• 线上部分环境GTID，Auto_Position=0，故障切换会变成GTID，Auto_Position=1(以前没考虑)
• 梳理故障切换流程(以前梳理)html

1、GTID环境，KILL主库，新主库和从库丢失数据

需在配置文件将Master/Binlog Server配置到[binlogN]，才能补全Dead Master上的差别数据，不然只应用到Latest Slave
发散：[binlogN]指定到Binlog Server，kill -9 master_mysqld，MHA是从Binlog Server上获取仍是从Dead Master上获取差别binlog？
指定到Binlog Server就从Binlog Server上获取，指定到Dead Master就到Dead Master获取；若是没有指定，就不会补全差别数据mysql

2、MHA切换测试

在数据库进程挂掉，数据库服务器关机或重启、开启防火墙、关闭网络服务等情况下，测试MHA是否正常切换
MySQL5.7.21，基于Row+Gtid搭建的一主两从复制结构：Master132->{Slave13三、Slave134}；VIP在132上，mha-manager 0.56在134上git

测试场景	XX.132	XX.133	XX.134	说明
132：kill -9 mysqld	不可用	主	从	MHA正常切换，数据不丢失
132：关闭或重启132服务器	不可用	主	从	MHA正常切换，数据可能丢失
134：iptables -I INPUT -s XX.132 -j DROP	可用	主	从	MHA正常切换，原主库正常访问，133成为新主库，132和133同时存在VIP
132：service network stop/ifconfig eth0 down	不可用	主	从	MHA正常切换，数据可能丢失

注：上述表格是配置[binlogN]指定到Binlog Server，没有指定secondary_check_script的测试结果
关闭数据库服务器，数据可能丢失的缘由：Binlog Server是异步，高并发下binlog延迟能够理解
开启防火墙，模拟主库与mha-manager不通信，出现脑裂。配置文件添加"secondary_check_script=masterha_secondary_check -s remote_host1 -s remote_host2"，remote_host一、remote_host2尽可能与mha-manager、MySQL Server处于不一样网段github

3、GTID，Auto_Position=0，故障切换变成GTID，Auto_Position=1

3.一、Auto_Position

线上部分环境GTID，Auto_Position=0，故障切换会变成GTID，Auto_Position=1
• 有何风险
若是S1从库的GTIDs存在空洞，S2从库的GTIDs正常，随着时间推移，S2将S1上GTIDs空洞对应的binlog删除。此时发生故障切换，且选择S2作为新Master，在S1 change master to S2 master_auto_position=1会报错sql

Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

View Code

从库存在GTIDs空洞可能会致使切换异常：VIP正常切换，新主可用，新从到新主之间的复制报错，只有修复主从报错，才会作后续操做(new master cleanup、Failover Report、send mail)
• 为什么不直接修改成GTID，Auto_Position=1
Slave GTIDs<Master GTIDs，若是Diff GTIDs对应的binlog在Master已被purge，修改成Auto_Position=1会继续报错
Slave GTIDs>Master GTIDs，5.7下主从直接报错
• 如何解决
修改源码~~~shell

shell> vim /usr/share/perl5/vendor_perl/MHA/ServerManager.pm 1550     return 1 if ( $_->{use_gtid_auto_pos} ); -->修改成 1550     #return 1 if ( $_->{use_gtid_auto_pos} ); shell> vim /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm 367     if ( !$_server_manager->is_gtid_auto_pos_enabled() ) { 368       $log->info("GTID (with auto-pos) is not supported"); -->修改成 367     if ( $_server_manager->is_gtid_auto_pos_enabled() !=1 ) { 368       $log->info("GTID (with auto-pos) is not supported");

View Code

为啥这样修改表示看不懂，感谢顺子(另外一位同窗)分享~
注意：传统复制(gtid_mode=off)，MHA不会利用Binlog Server补全差别数据(又是一个坑●-●)数据库

Binlog Server
Starting from MHA version 0.56, MHA supports new section [binlogN]. In binlog section, you can define mysqlbinlog streaming servers. When MHA does GTID based failover, MHA checks binlog servers, and if binlog servers are ahead of other slaves, MHA applies differential binlog events to the new master before recovery. When MHA does non-GTID based (traditional) failover, MHA ignores binlog servers.

3.二、什么状况会出现GTID空洞

一、从库暂停Slave Thread->主库写数据->主库flush log、purge log->从库启动Slave Thread->报错，缺失binary log
手工执行change master_auto_position=0;change binlog file & pos;
二、搭建复制时change master_auto_position=0;->复制过程暂停Slave Thread->change new_file & new_pos->从库启动Slave Thread
master_auto_position=1，Slave链接Master时，会把Executed_Gtid_Set中的GTIDs发给Master，Master会跳过Executed_Gtid_Set，把没有执行过的GTIDs发送给Slave
状况1：再次change master_auto_position=1;它依旧会去查找那些被purge的binlog，而后抛出错误
状况2：再次change master_auto_position=1;只要主上对应binlog没被purge，它能自动将空洞GTID补全
前提：Master没有对GTIDs空洞相应的记录进行DML操做，否则复制早就报错了，可能就错过这个坑~不过仔细想一想，从库原本就有空洞，复制也没报错，侧面反映Master没有对GTIDs空洞相应的记录进行DML操做
扩展阅读：[MySQL FAQ]系列 — 5.6版本GTID复制异常处理一例vim

3.三、relay-log是如何获取及应用

Slave GTIDs>Master GTIDs，relay-log是如何获取及应用服务器

• GTID，auto_position=0 Master Executed_Gtid_Set：90b30799-9215-11e7-8645-000c29c1025c:1-14 Slave set global Gtid_Purged='90b30799-9215-11e7-8645-000c29c1025c:1-6:8-24'; -->Master写入一条数据 Master Executed_Gtid_Set：90b30799-9215-11e7-8645-000c29c1025c:1-15 Slave Retrieved_Gtid_Set: 90b30799-9215-11e7-8645-000c29c1025c:15 Executed_Gtid_Set: 90b30799-9215-11e7-8645-000c29c1025c:1-6:8-24

View Code

新写入的binlog会写到从库的relay-log，可是不会应用(能够经过查看数据、解析日志确认)！网络

• GTID，auto_position=1 change master to master_auto_position=1; 启动复制报错 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: Slave has more GTIDs than the master has, using the master`s SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replica

View Code

relay-log获取
Auto_Position=0，若是开启relay-log自动修复机制，发生crash时根据relay_log_info中记录的已执行的binlog位置从master上从新获取写入relay-log
Auto_Position=1，Slave链接Master时，会把Executed_Gtid_Set中的GTIDs发给Master，Master会跳过Executed_Gtid_Set，把没有执行过的GTIDs发送给Slave。若是Slave上的GTIDs大于Master上的GTIDs，5.7下直接报错，5.6下不会报错(有环境的自行验证，顺便看看relay-log会不会有记录写入)
relay-log应用
若是relay-log中的GTIDs包含在Executed_Gtid_Set里，则不会apply-log

4、故障切换流程

MHA在传统复制和GTID复制下，主库发生故障，如何选举New Master，如何修复差别数据
详细流程请参考：MHA-手动Failover流程(传统复制&GTID复制)