【MySQL】你觉得设置了并行复制就下降延迟了？这个你绝对想不到！

时间 2019-12-01

标签 MySQL 觉得设置并行复制下降延迟这个绝对想不到栏目 MySQL 繁體版

原文原文链接

在MySQL官方版本中，为了保证其的高可用性，通常状况咱们会采用主从复制的方式来解决。固然，方法不少。而咱们今天所要处理的是采用GTID方式而且开了多线程复制后，仍然延迟的状况，糟糕的是，延迟还在不断扩大！mysql

环境概要

序号	清单	说明
1	系统	Redhat 6.x（4c，32g）
2	数据库	MySQL-5.7.25
3	同步方式	基于GTID主从同步

环境排查

1）已经配置的重要参数：sql

relay for slave

slaveparalleltype = LOGICAL_CLOCKslaveparallelworkers = 6masterinforepository = TABLErelayloginfo_repository = TABLErelaylogrecovery = onsyncrelaylog = 10000注：此时没有设置slave_preserve_commit_order参数。数据库

2）从库延迟状态查询session

mysql> show slave statusG* 1. row *SlaveIOState: Queueing master event to the relay logMaster_Host: xxx.xxx.xxx.xxxMaster_User: replMaster_Port: 3306Connect_Retry: 60MasterLogFile: mysql-bin.008978ReadMasterLog_Pos: 696914605RelayLogFile: DB41-relay-bin.001259RelayLogPos: 207377582RelayMasterLog_File: mysql-bin.008970SlaveIORunning: YesSlaveSQLRunning: YesReplicateDoDB: neteagle3ReplicateIgnoreDB: mysqlReplicateDoTable:ReplicateIgnoreTable:ReplicateWildDo_Table:ReplicateWildIgnore_Table:Last_Errno: 0Last_Error:Skip_Counter: 0ExecMasterLog_Pos: 1068770059RelayLogSpace: 8425484286Until_Condition: NoneUntilLogFile:UntilLogPos: 0MasterSSLAllowed: NoMasterSSLCA_File:MasterSSLCA_Path:MasterSSLCert:MasterSSLCipher:MasterSSLKey:SecondsBehindMaster: 187358MasterSSLVerifyServerCert: NoLastIOErrno: 0LastIOError:LastSQLErrno: 0LastSQLError:ReplicateIgnoreServer_Ids:MasterServerId: 42Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2MasterInfoFile: mysql.slavemasterinfoSQL_Delay: 0SQLRemainingDelay: NULLSlaveSQLRunning_State: Waiting for dependent transaction to commitMasterRetryCount: 86400Master_Bind:LastIOError_Timestamp:LastSQLError_Timestamp:MasterSSLCrl:MasterSSLCrlpath:RetrievedGtidSet: eab7fcac-3cda-11e6-ada8-fa163e648db2:58031191-59927276ExecutedGtidSet: eab7fcac-3cda-11e6-ada8-fa163e648db2:1-58080239:58080241Auto_Position: 1ReplicateRewriteDB:Channel_Name:MasterTLSVersion:1 row in set (0.00 sec)

简单介绍一下几个指标信息：

MasterLogFile
ReadMasterLog_Pos
SecondsBehindMaster
RelayLogFile
RelayLogPos
RelayMasterLog_File
ExecMasterLog_Pos

MasterLogFile,ReadMasterLog_Pos：这两个参数是成对的，表示的是从库IO thread传输主库的binlog日志号及具体位置。多线程

RelayLogFile,RelayLogPos：这两个参数也是成对的，表示的是从库sql thread应用中继日志（relay log）号及具体位置。app

RelayMasterLog_File,RelayLogPos：这两个参数也是成对的，表示的是上一项中的中继日志对应的主库binlog日志及具体位置（有点绕）。性能

SecondsBehindMaster：此参数可简单理解为主从延迟时间，单位为秒。spa

从上面这段MySQL从状态信息中，咱们能够看到，Seconds_Behind_Master: 187358这是从库sql应用延迟主库的时间为187358秒，转换整天，大概两天多。这说明，咱们从库复制的数据是两天前的。线程

3)验证并行复制日志

mysql> show full processlist;+----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+| Id | User        | Host         | db        | Command          | Time   | State                                                         | Info                  |+----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+|  1 | system user |              | NULL      | Connect          |  18204 | Waiting for master to send event                              | NULL                  ||  2 | system user |              | NULL      | Connect          |      0 | Waiting for dependent transaction to commit                   | NULL                  ||  3 | system user |              | NULL      | Connect          | 154914 | System lock                                                   | NULL                  ||  4 | system user |              | NULL      | Connect          | 154914 | Waiting for an event from Coordinator                         | NULL                  ||  5 | system user |              | NULL      | Connect          | 154918 | Waiting for an event from Coordinator                         | NULL                  ||  6 | system user |              | NULL      | Connect          | 155525 | Waiting for an event from Coordinator                         | NULL                  ||  7 | system user |              | NULL      | Connect          | 180427 | Waiting for an event from Coordinator                         | NULL                  ||  8 | system user |              | NULL      | Connect          |  18204 | Waiting for an event from Coordinator                         | NULL                  || 10 | root        | localhost    | neteagle3 | Query            |      0 | starting                                                      | show full processlist || 11 | repl        | DBSlave:9683 | NULL      | Binlog Dump GTID |  18156 | Master has sent all binlog to slave; waiting for more updates | NULL                  || 13 | root        | localhost    | neteagle3 | Sleep            |   4962 |                                                               | NULL                  |+----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+

mysql> select * from  performance_schema.replication_applier_status_by_worker ;+--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+| CHANNELNAME | WORKERID | THREADID | SERVICESTATE | LASTSEENTRANSACTION                         | LASTERRORNUMBER | LASTERRORMESSAGE | LASTERRORTIMESTAMP |+--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+|              |         1 |        51 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240805 |                 0 |                    | 0000-00-00 00:00:00  ||              |         2 |        52 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240210 |                 0 |                    | 0000-00-00 00:00:00  ||              |         3 |        53 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80235089 |                 0 |                    | 0000-00-00 00:00:00  ||              |         4 |        54 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:80191268 |                 0 |                    | 0000-00-00 00:00:00  ||              |         5 |        55 | ON            | eab7fcac-3cda-11e6-ada8-fa163e648db2:75296683 |                 0 |                    | 0000-00-00 00:00:00  ||              |         6 |        56 | ON            |                                               |                 0 |                    | 0000-00-00 00:00:00  |+--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+6 rows in set (0.00 sec)经过本条查询，能够看到开启了6个并行进行复制。

细找瓶颈？

经过以上来看，全部一切彷佛都正常，并行复制开了，而且查看到CPU，IO，内存均没有达到瓶颈地步。主库写binlog日志大概为2MB/s,这样的日质量并非很是高。

从库也一样查看了是否存在锁的状况，也没有发现。

所以反复的在查看slave状态，看可否发现一些细节，结果还真看到了一些异常现象。Relay_Log_Pos这个参数在频繁的刷slave状态时，发现时常会卡着不动（此时已确认没有看到锁）。或许问题真正的缘由正在这里！

顺藤摸瓜！

分析binlog或者relay log日志，看有啥线索：

[mysql@xxx data]$ mysqlbinlog --no-defaults -v -v --base64-output=DECODE-ROWS  relay-bin.001384 --start-position=420090430|more

/!50530 SET @@SESSION.PSEUDOSLAVEMODE=1/;/!50003 SET @OLDCOMPLETIONTYPE=@@COMPLETIONTYPE,COMPLETIONTYPE=0/;DELIMITER /!/;# at 420090430

190923 9:24:28 server id 42 endlogpos 420090282 CRC32 0xd9097eaf GTID lastcommitted=57148 sequencenumber=57149 rbr_only=yes

/!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED//!/;SET @@SESSION.GTID_NEXT= 'eab7fcac-3cda-11e6-ada8-fa163e648db2:69415610'/*!*/;

at 420090495

190923 9:24:28 server id 42 endlogpos 420090364 CRC32 0x82b57dfd Query threadid=95 exectime=0 error_code=0

SET TIMESTAMP=1569201868/!/;SET @@session.pseudothreadid=95/!/;SET @@session.foreignkeychecks=1, @@session.sqlautoisnull=0, @@session.uniquechecks=1, @@session.autocommit=1/!/;SET @@session.sql_mode=1075838976/*!*/;SET @@session.autoincrementincrement=2, @@session.autoincrementoffset=1/!/;/!C gbk //!/;SET @@session.charactersetclient=28,@@session.collationconnection=28,@@session.collationserver=8/!/;SET @@session.lctimenames=0/!/;SET @@session.collation_database=DEFAULT/*!*/;BEGIN/!/;

at 420090577

190923 9:24:28 server id 42 endlogpos 420090585 CRC32 0x752e27cf Tablemap: `net`.`fevent` mapped to number 108

at 420090798

190923 9:24:28 server id 42 endlogpos 420090812 CRC32 0x72b8e10d Tablemap: `net`.`feventstorage` mapped to number 245

at 420091025

190923 9:24:28 server id 42 endlogpos 420091039 CRC32 0x1797f9d8 Tablemap: `net`.`feventstorage` mapped to number 245

at 420091252

190923 9:24:28 server id 42 endlogpos 420091106 CRC32 0x8af14ad2 Tablemap: `net`.`feventdetail` mapped to number 243

at 420091319

190923 9:24:28 server id 42 endlogpos 420091177 CRC32 0xf1ce87c8 Tablemap: `net`.`feventoperation` mapped to number 244

at 420091390

190923 9:24:28 server id 42 endlogpos 420091244 CRC32 0x586c0b9d Tablemap: `net`.`feventaudit` mapped to number 242

at 420091457

190923 9:24:28 server id 42 endlogpos 420093382 CRC32 0x505e5408 Update_rows: table id 108

at 420093595

190923 9:24:28 server id 42 endlogpos 420098858 CRC32 0x0f404509 Update_rows: table id 245

at 420099071

190923 9:24:28 server id 42 endlogpos 420098910 CRC32 0xb8d9ed15 Write_rows: table id 243

at 420099123

190923 9:24:28 server id 42 endlogpos 420098966 CRC32 0x3c489a7f Writerows: table id 244 flags: STMTEND_F

咱们查看了中继日志relay-bin.001384卡住的位置号420090430，为设置GTID_NEXT，信息没什么用。

继续看在卡住时刻，数据库open的是什么表？

mysql>   show open tables where In_use=1;+-----------+---------------------+--------+-------------+| Database  | Table               | Inuse | Namelocked |+-----------+---------------------+--------+-------------+| net       | f_currentxxx        |      1 |           0 |+-----------+---------------------+--------+-------------+

这张表有什么特殊的么？查看其表结构

mysql>         show create table net.f_currentxxxG* 1. row *Table: f_currentxxxCreate Table: CREATE TABLE  ( int(20) NOT NULL COMMENT 'xxx', bigint(20) NOT NULL COMMENT 'xxx', int(11) DEFAULT NULL COMMENT 'xxx', int(11) DEFAULT NULL COMMENT 'xxx,KEY  ()) ENGINE=MEMORY DEFAULT CHARSET=gbk COMMENT='xxx'1 row in set (0.00 sec)f_currentxxx复制代码serial复制代码audittime复制代码type复制代码severity复制代码audittime复制代码audittime复制代码

有没有看到什么不同凡响？

没错，就是表的存储引擎ENGINE=MEMORY。MEMORY的表进行主从复制，首先来讲若是从库不作查询，一点意义没有，另外对Memory表作复制，性能是很是堪忧的。若是必须进行同步，考虑将表的存储引擎改成InnoDB

mysql> select tablename from informationschema.tables where TABLE_SCHEMA='net' and ENGINE='memory';+----------------------+| table_name           |+----------------------+| f_currentxxx         |+----------------------+1 row in set, 6 warnings (0.01 sec)出于严谨，咱们将要复制的数据库进行完全排查，确实只有这一张表是Memory存储引擎。

措施

中止复制进程，将选项中添加replicate-ignore-table=net.f_currentxxx,从新启动复制进程，观察slave状态。

mysql> show slave statusG* 1. row *SlaveIOState: Waiting for master to send eventMaster_Host: xxx.xxx.xxx.xxxMaster_User: replMaster_Port: 3306Connect_Retry: 60MasterLogFile: mysql-bin.009194ReadMasterLog_Pos: 939698255RelayLogFile: relay-bin.001964RelayLogPos: 444060572RelayMasterLog_File: mysql-bin.009027SlaveIORunning: YesSlaveSQLRunning: YesReplicateDoDB: netReplicateIgnoreDB: mysqlReplicateDoTable:ReplicateIgnoreTable: net.f_currentxxxReplicateWildDo_Table:ReplicateWildIgnore_Table:Last_Errno: 0Last_Error:Skip_Counter: 0ExecMasterLog_Pos: 444060359RelayLogSpace: 180287882098Until_Condition: NoneUntilLogFile:UntilLogPos: 0MasterSSLAllowed: NoMasterSSLCA_File:MasterSSLCA_Path:MasterSSLCert:MasterSSLCipher:MasterSSLKey:SecondsBehindMaster: 179221MasterSSLVerifyServerCert: NoLastIOErrno: 0LastIOError:LastSQLErrno: 0LastSQLError:ReplicateIgnoreServer_Ids:MasterServerId: 42Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2MasterInfoFile: mysql.slavemasterinfoSQL_Delay: 0SQLRemainingDelay: NULLSlaveSQLRunning_State: Waiting for dependent transaction to commitMasterRetryCount: 86400Master_Bind:LastIOError_Timestamp:LastSQLError_Timestamp:MasterSSLCrl:MasterSSLCrlpath:RetrievedGtidSet: eab7fcac-3cda-11e6-ada8-fa163e648db2:69497322-107886661ExecutedGtidSet: 1264a536-da12-11e9-81ea-005056856ba5:1,eab7fcac-3cda-11e6-ada8-fa163e648db2:1-71980857Auto_Position: 1ReplicateRewriteDB:Channel_Name:MasterTLSVersion:1 row in set (0.00 sec)

咱们能够看到net.fcurrentxxx表已经被忽略复制。持续观察一段时间后，SecondsBehind_Master在逐渐缩小。中继日志应用速度大约5分钟一个（每一个中继日志为1GB大小），而主库binlog日志大约为10分钟一个（每一个binlog日志为1GB大小）。

总结

在梳理了整个处理过程后，其实难度不高，主要是要细心，细心去排查每个想到的点。在非轻量级的数据库中问题发生的几率也会随着量级的增长而增多。而这偏偏是可以磨练我的的成长。

同时，知识的储备也要充足，这是进阶高手的必要前提！