mysql5.5 物理删除binlog文件致使的故障

时间 2019-11-10

标签 mysql5.5 mysql 物理删除 binlog 文件致使故障栏目 MySQL 繁體版

原文原文链接

故障现象：mysql

中午12点多，一套主从集群的主库由于没有配置大页内存，发布时致使OOM，MYSQL实例重启了，而后MHA发生了切换。切换过程正常。切换后须要把原master配置成新master的slave，在manager.log文件里面找到change master to ....命令，执行后发现复制状态一直停留在connectiong 。名称定：OOM的是M1，挂掉后顶替的是S1.sql

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting to reconnect after a failed master event read
                  Master_Host: 10.3.171.40
                  Master_User: rep_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: centos-bin.000002
          Read_Master_Log_Pos: 107
               Relay_Log_File: relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: centos-bin.000002 Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 107
              Relay_Log_Space: 107
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2017140

检查错误日志文件，日志以下，提示在S1上找不到master上的binlog文件数据库

160408 12:25:40 [Note] Slave I/O thread: connected to master 'rep_user@10.3.171.40:3306',replication started in log 'centos-bin.000002' at position 107
160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
160408 12:25:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107
160408 12:25:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)
160408 12:26:40 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'centos-bin.000002' at postion 107
160408 12:26:40 [ERROR] Error reading packet from server: File '/data2/mysql/centos-bin.000002' not found (Errcode: 2) ( server_errno=29)

到S1上去检查，show master status;show master logs能够看到业务数据在写入，POS位置也一直在改变，这里奇怪的是00001文件的大小是0centos

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| centos-bin.000001 |         0 |
| centos-bin.000002 | 568661746 |
+-------------------+-----------+
2 rows in set (0.00 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| centos-bin.000001 |         0 |
| centos-bin.000002 | 568941034 |
+-------------------+-----------+
2 rows in set (0.00 sec)

mysql> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| centos-bin.000001 |         0 |
| centos-bin.000002 | 569017617 |
+-------------------+-----------+
2 rows in set (0.00 sec)

到data目录查看，却没有找到这2个文件。复制提示也是找不到文件post

到这里奇特的现象是：业务正常写数据库，show master status也能够看到有pos位置变化，可是磁盘上没有文件，复制没法创建测试

[root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]# find / -name centos-bin.000002
[root@GZ_NS_M5_SYNC_mysql_sync1-standby_171.40 ~]#

#故障重现spa

1）正常启动实例，开启binlog，配置复制环境日志

2）rm 把主库的binlog.index.binlog.0000X删除code

3）继续写入数据，pos位置变化server

4）从库报错，找不到binlog文件

#为何会出现这样的状况

回想起来这个故障，应该和故障重现的过程是同样的，这套集群3,4个月前搭起来的，在复制正常后，standby的binlog相关文件被删除了，其实删除的整个目录，这个目录专门用来存binlog,relaylog的。删除后搭建复制的时候作change master to，把relay log重建了，可是binlog没有。今天发生了MHA切换，standby变成了master,接受数据写入。MHA里面的filename,pos是连到standby作show master status获得的，可是这些文件已经被删除。因此复制出错。

#继续作实验

1）生成binlog.0001后，把binlog.index,binlog.00001都rm后，数据写入，pos逐步变大，当超过1G大小作文件切换，会发生什么？

答：当1写满后作切换，binlog.index没有，拿不到最大的文件ID，那就又从1开始。结论：一直写00001文件

2）留下index文件，把00001删除，继续写入，超过1G大小会怎么样？

答：会生成00002文件，这个文件是落地磁盘的正常的binlog文件。

#今天出现的故障，如何把events拿出来？

测试下来，若是是statement的，能够经过show master events in xxxx，获得binlog的命令。若是是row格式的，拿不到具体的SQL命令。