sql_slave_skip_counter,你真的用对了吗?

写在前面:html

    最近一段时间都在作传统主从复制相关的测试,思考了不少上线主从复制架构后,可能会发生的问题,而后针对性设置了这些故障,再而后思考如何在保证业务可用,或者对业务冲击尽量的小的前提下,进行故障的恢复,也算是小有所得,如今呢,笔者就主从复制故障时,须要跳过故障点时所用到的 sql_slave_skip_counte 变量进行记录描述mysql


sql_slave_skip_counter 介绍:sql

摘自MySQL官方的解释(强烈建议阅读英文原文。中文版,是笔者本身的理解,只能说仁者见仁)session

SET GLOBAL sql_slave_skip_counter Syntax:
        SET GLOBAL sql_slave_skip_counter = N
This statement skips the next N events from the master. This is useful for recovering from replication stops caused by a statement.
        跳过N个events。注意:以event为单位,而不是以事务为单位,只有在由单条语句组成的事务时,二者才等价。
        如:一个事务由多个EVENT组成,BEGIN;INSERT;UPDATE;DELETE;COMMOIT; 这种状况下,二者毫不相等
This statement is valid only when the slave threads are not running. Otherwise, it produces an error.

When using this statement, it is important to understand that the binary log is actually organized as a sequence of groups known as event groups. Each event group consists of a sequence of events.

For transactional tables, an event group corresponds to a transaction.
        对于事务表,一个event group对应一个事务
or nontransactional tables, an event group corresponds to a single SQL statement.
        对于非事务表,一个event group对应一条SQL
When you use SET GLOBAL sql_slave_skip_counter to skip events and the result is in the middle of a group, the slave continues to skip events until it reaches the end of the group. Execution then starts with the next event group
        当你跳过event的时候,若是N的值,处于event group之中,那么slave会继续跳过event,直至跳过这个event group,从下一个event group开始
架构


对于事务表使用sql_slave_skip_counter的状况:app


一、跳过1032复制错误(update/delete error)
ide

跳过由单条SQL组成的事务:工具


在Slave主机上人为的删除两条数据:测试

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '9';
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '11';
this


而Master在变动上述两条记录的时候会报错,致使复制中断:

INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('孙权', '吴国', '妹妹');
UPDATE `edusoho_e`.`t1` SET xname='游戏' WHERE id=7;
UPDATE `edusoho_e`.`t1` SET age=40 WHERE id=11;    #报错
DELETE FROM `edusoho_e`.`t1` WHERE age=40;            #报错
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('曹丕', '魏国', '甄姬');
DELETE FROM `edusoho_e`.`t1` WHERE id=1;
UPDATE `edusoho_e`.`t1` SET hobby='Games' WHERE id=3; 


在Slave查看主从复制状态时,就会发现报错信息:

mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 2176
Exec_Master_Log_Pos: 874
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 1127
Slave_IO_Running: Yes
Slave_SQL_Running: No


在Master主机上查看position作了什么操做:

mysql> show binlog events in 'mysql-bin.000002' from 874;
+------------------+------+-------------+-----------+-------------+---------------------------------+
| Log_name         | Pos  | Event_type  | Server_id | End_log_pos | Info                            |
+------------------+------+-------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 |  874 | Query       |         2 |         956 | BEGIN                           |
| mysql-bin.000002 |  956 | Table_map   |         2 |        1017 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1017 | Update_rows |         2 |        1127 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1127 | Xid         |         2 |        1158 | COMMIT /* xid=437 */            |
| mysql-bin.000002 | 1158 | Query       |         2 |        1240 | BEGIN                           |
| mysql-bin.000002 | 1240 | Table_map   |         2 |        1301 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1301 | Delete_rows |         2 |        1407 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1407 | Xid         |         2 |        1438 | COMMIT /* xid=446 */            |
| mysql-bin.000002 | 1438 | Query       |         2 |        1520 | BEGIN                           |
| mysql-bin.000002 | 1520 | Table_map   |         2 |        1581 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1581 | Write_rows  |         2 |        1644 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1644 | Xid         |         2 |        1675 | COMMIT /* xid=455 */            |
| mysql-bin.000002 | 1675 | Query       |         2 |        1757 | BEGIN                           |
| mysql-bin.000002 | 1757 | Table_map   |         2 |        1818 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 1818 | Delete_rows |         2 |        1880 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 1880 | Xid         |         2 |        1911 | COMMIT /* xid=464 */            |
| mysql-bin.000002 | 1911 | Query       |         2 |        1993 | BEGIN                           |
| mysql-bin.000002 | 1993 | Table_map   |         2 |        2054 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 2054 | Update_rows |         2 |        2145 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 2145 | Xid         |         2 |        2176 | COMMIT /* xid=473 */            |
+------------------+------+-------------+-----------+-------------+---------------------------------+


在Slave跳过第一个Update_rows event复制报错:

mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
Slave_IO_Running: Yes
Slave_SQL_Running: No
Exec_Master_Log_Pos: 1158
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 1407


成功跳过第一个events group


在Slave继续跳过第二个Delete_rows event复制报错:

mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_SQL_Errno: 0
Last_SQL_Error:


成功跳过第二个events group


注意:

虽然主从复制出现的故障成功跳过了,但只是暂时恢复了正常的主从复制状态,须要尽快的对Slave缺失的数据进行补齐,否则Master对Slave不存在的数据作的变动,仍然会重复致使主从复制故障,笔者以为若是你的数据量差别不是太大的话,能够考虑使用pt-table-checksumpt-table-sync工具进行恢复,若是你的数据量很大且数据差别不少,仍是建议重作Slave较好,由于使用工具会锁表,会对线上业务形成必定的影响,具体状况,请自行考量。


跳过由多条SQL(event)组成的事务:

在Slave主机上人为的删除一条数据:

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '7';


在Master主机上产生一个由多条SQL组成的事务:

BEGIN;
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '7';
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('懒死', '不知道', '吃了睡睡了吃');
COMMIT;


由于Slave主机上已经删除id=7的数据,在Slave查看主从复制状态时,就会发现报错信息:

mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 7219
Exec_Master_Log_Pos: 6840
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000002, end_log_pos 7049


在Master主机上查看position作了什么操做:

mysql> show binlog events in 'mysql-bin.000002' from 6840;
+------------------+------+-------------+-----------+-------------+---------------------------------+
| Log_name         | Pos  | Event_type  | Server_id | End_log_pos | Info                            |
+------------------+------+-------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 6840 | Query       |         2 |        6922 | BEGIN                           |
| mysql-bin.000002 | 6922 | Table_map   |         2 |        6983 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 6983 | Delete_rows |         2 |        7049 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7049 | Table_map   |         2 |        7110 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 7110 | Write_rows  |         2 |        7188 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7188 | Xid         |         2 |        7219 | COMMIT /* xid=825 */            |
+------------------+------+-------------+-----------+-------------+---------------------------------+


能够看到,这个事务是由两个SQL(event)组成的


若是使用 sql_slave_skip_counter=N 跳过由多条SQL组成的事务会怎样呢?

mysql> set global sql_slave_skip_counter=1;
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************
Read_Master_Log_Pos: 7219
Exec_Master_Log_Pos: 7219
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:


发现问题没有,在使用sql_slave_skip_counter跳过由多条SQL(event)组成的事务时,从在Master上执行的 show binlog events 能够看到,若是只是跳过出报错SQL语句,那么 Exec_Master_Log_Pos 值应该为7110,可是如今为7219,说明将整个event group跳过了,可是7110的SQL数据是咱们须要的,因此,和单条SQL组成的事务同样,主从复制状态虽然恢复,可是数据仍处于不一致状态,要抓紧时间补齐数据或重作Slave


二、由多条SQL(event)组成的事务时,仅跳过一个event,而不是一个event group:

在Slave主机上人为的删除一条数据:

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '17';


在Master主机上产生一个由多条SQL组成的事务:

BEGIN;
DELETE FROM `edusoho_e`.`t1` WHERE `id` = '17';
INSERT INTO `edusoho_e`.`t1` (`xname`, `address`, `hobby`) VALUES ('我是谁', '不知道', '吃了睡睡了吃');
COMMIT;


由于Slave主机上已经删除id=17的数据,在Slave查看主从复制状态时,就会发现报错信息:

Exec_Master_Log_Pos: 120

Slave_IO_Running: Yes
Slave_SQL_Running: No

Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table edusoho_e.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 341


在Master主机上查看position作了什么操做:

mysqlbinlog -v --base64-output=decode --start-position=120 mysql-bin.000004
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 120
#190507 13:52:05 server id 2  end_log_pos 202 CRC32 0x0ca0c280     Query    thread_id=3    exec_time=0    error_code=0
SET TIMESTAMP=1557208325/*!*/;
SET @@session.pseudo_thread_id=3/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=2, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 202
#190507 13:52:05 server id 2  end_log_pos 263 CRC32 0x20d2e89d     Table_map: `edusoho_e`.`t1` mapped to number 216
# at 263
#190507 13:52:05 server id 2  end_log_pos 341 CRC32 0xbec6fd45     Delete_rows: table id 216 flags: STMT_END_F
### DELETE FROM `edusoho_e`.`t1`
### WHERE
###   @1=17
###   @2='懒死'
###   @3='不知道'
###   @4=1
###   @5='吃了睡睡了吃'
###   @6=18

# at 341
#190507 13:52:05 server id 2  end_log_pos 402 CRC32 0xa37bc5c9     Table_map: `edusoho_e`.`t1` mapped to number 216
# at 402
#190507 13:52:05 server id 2  end_log_pos 483 CRC32 0x0d774707     Write_rows: table id 216 flags: STMT_END_F
### INSERT INTO `edusoho_e`.`t1`
### SET
###   @1=21
###   @2='我是谁'
###   @3='不知道'
###   @4=1
###   @5='吃了睡睡了吃'
###   @6=18
# at 483
#190507 13:52:05 server id 2  end_log_pos 514 CRC32 0x8c333b30     Xid = 411
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

能够看到,绿色的部分就是咱们须要跳过的,而第二个event是须要咱们保留的

这个时候,就须要用到slave_exec_mode这个变量了,至于slave_exec_mode详细介绍,仍是请参考MySQL官网资料


mysql> set global slave_exec_mode='IDEMPOTENT';
mysql> start slave sql_thread;
mysql> show slave status\G;
*************************** 1. row ***************************

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

Exec_Master_Log_Pos: 514


去Slave上edusoho_e.t1表上查看,数据id=21的数据已通过去了,此时,数据处于一致性状态


三、跳过主键冲突1062错误(Duplicate entry):

在Slave主键上先插入一条id值:

INSERT INTO `edusoho_e`.`t1` (`id`,`xname`, `address`, `hobby`, `age`) VALUES (19,'小玩子', '明朝', '皇后', '25');


由于Slave已经占用了Master要自动产生的主键值id=19,因此Slave主机会报错:

INSERT INTO `edusoho_e`.`t1` (`id`,`xname`, `address`, `hobby`, `age`) VALUES (19,'朱棣', '明朝', '皇帝', '36'); 


查看Slave主从复制状态发现已经发生了主从复制报错:

mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1062
Last_Error: Could not execute Write_rows event on table edusoho_e.t1; Duplicate entry '19' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000002, end_log_pos 7425
Exec_Master_Log_Pos: 7219


查看Master binlog:

mysql> show binlog events in 'mysql-bin.000002' from 7219;
+------------------+------+------------+-----------+-------------+---------------------------------+
| Log_name         | Pos  | Event_type | Server_id | End_log_pos | Info                            |
+------------------+------+------------+-----------+-------------+---------------------------------+
| mysql-bin.000002 | 7219 | Query      |         2 |        7301 | BEGIN                           |
| mysql-bin.000002 | 7301 | Table_map  |         2 |        7362 | table_id: 213 (edusoho_e.t1)    |
| mysql-bin.000002 | 7362 | Write_rows |         2 |        7425 | table_id: 213 flags: STMT_END_F |
| mysql-bin.000002 | 7425 | Xid        |         2 |        7456 | COMMIT /* xid=893 */            |
+------------------+------+------------+-----------+-------------+---------------------------------+


思考:

由于Slave这条数据已经存在,若是在Slave主机上把这条数据删除了,Slave会不会直接同步过来?(答案是:不会。须要重启Slave thread):

DELETE FROM `edusoho_e`.`t1` WHERE `id` = '19';
mysql> stop slave;
mysql> start slave user='repliter' password='123456';

验证的时候,发现数据已经同步过去了


题外:

以上是笔者对于单条SQL组成的事务、多条SQL组成的事务,及在这些单/多条SQL组成的事务下,人为设置的1032和1062复制错误和解决方法,还有sql_slave_skip_counter和slave_exec_mode各自的用法和跳过的范围,固然了,笔者呢,作的只是线上应用前的部署测试,并无通过任何的实战检测。一方面,仅为广大同行作个参考;另外一方面,记录笔者本身的心得和针对问题解决的思路作个总结,当问题真正发生的时候,有个方向能够进行参考,而不至于手忙脚乱,不知所措,因此,对其中有误之处和理解不到位的地方,望请下方留言指正,不胜感激!

还有,笔者作的,只是针对事务表,作的sql_slave_skip_counter和slave_exec_mode测试,对于非事务表,sql_slave_skip_counter和slave_exec_mode用途会稍有不一样,请自行百度吧

相关文章
相关标签/搜索