Redis主从复制(读写分离)、哨兵(主从切换)配置

时间 2019-11-05

标签 redis 主从复制读写分离哨兵切换配置栏目 Redis 繁體版

原文原文链接

Redis的主从复制功能很是强大，一个master能够拥有多个slave，而一个slave又能够拥有多个slave，如此下去，造成了强大的多级服务器集群架构。
官网：https://redis.io/node

环境：
Master:

[root@Master ~]# uname -a
Linux Master.Redis 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@Master ~]#

Slaveredis

[root@Slave ~]# uname -a
Linux Slave.Redis 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@Slave ~]#

实验中Slave上启动了2个实例，port1:6979 port2:6980算法

Redis
redis-4.0.10数据库

Redis主从复制(读写分离)配置vim

Redis主从复制能够根据是不是全量分为全量同步和增量同步：
一、全量同步
Redis全量复制通常发生在Slave初始化阶段，这时Slave须要将Master上的全部数据都复制一份。具体步骤以下：
　　1）从服务器链接主服务器，发送SYNC命令；
　　2）主服务器接收到SYNC命名后，开始执行BGSAVE命令生成RDB文件并使用缓冲区记录此后执行的全部写命令；
　　3）主服务器BGSAVE执行完后，向全部从服务器发送快照文件，并在发送期间继续记录被执行的写命令；
　　4）从服务器收到快照文件后丢弃全部旧数据，载入收到的快照；
　　5）主服务器快照发送完毕后开始向从服务器发送缓冲区中的写命令；
　　6）从服务器完成对快照的载入，开始接收命令请求，并执行来自主服务器缓冲区的写命令；

完成上面几个步骤后就完成了从服务器数据初始化的全部操做，从服务器此时能够接收来自用户的读请求。安全

2 增量同步
Redis增量复制是指Slave初始化后开始正常工做时主服务器发生的写操做同步到从服务器的过程。增量复制的过程主要是主服务器每执行一个写命令就会向从服务器发送相同的写命令，从服务器接收并执行收到的写命令。服务器

三、Redis主从同步策略
主从刚刚链接的时候，进行全量同步；全同步结束后，进行增量同步。固然，若是有须要，slave 在任什么时候候均可以发起全量同步。redis 策略是，不管如何，首先会尝试进行增量同步，如不成功，要求从机进行全量同步。网络

安装

一、redis主不须要特别配置，按照正常配置便可
二、redis从，须要在配置文件里指定redis主架构

Master并发

[root@Master ~]# mkdir /opt/soft
[root@Master ~]# cd /opt/soft/
[root@Master soft]#  wge thttp://download.redis.io/releases/redis-4.0.10.tar.gz
[root@Master soft]# tar -zxvf redis-4.0.10.tar.gz
[root@Master soft]# cd redis-4.0.10
[root@Master redis-4.0.10]# make
......
Hint: It's a good idea to run 'make test' ;)

make[1]: Leaving directory `/opt/soft/redis-4.0.10/src'
[root@Master redis-4.0.10]# cd src/
[root@Master src]# make test
......

\o/ All tests passed without errors!

Cleanup: may take some time... OK
[root@Master src]# mkdir -p /opt/redis/{logs,etc,data,bin}
[root@Master src]# vim /opt/redis/etc/redis.conf
[root@Master src]# cat /opt/redis/etc/redis.conf |grep -v "#"|sed '/^[[:space:]]*$/d'
bind 0.0.0.0
protected-mode yes
port 6979
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize yes
supervised no
pidfile /var/run/redis_6979.pid
loglevel notice
logfile "./logs/redis.log"
databases 16
always-show-logo yes
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir ./data/
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100
requirepass 51cto
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
slave-lazy-flush no
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
[root@Master src]# touch /opt/redis/logs/redis.log
[root@Master src]# echo 'vm.overcommit_memory = 1' >> /etc/sysctl.conf
[root@Master src]#  sysctl -p
[root@Master src]# echo '511' > /proc/sys/net/core/somaxconn
[root@Master src]# echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled 
[root@Master src]# cd /opt/redis/bin/
[root@Master bin]# ./redis-server ../etc/redis.conf 
[root@Master bin]# tail -500f ../logs/redis.log 
9084:C 30 Jun 23:16:40.766 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
9084:C 30 Jun 23:16:40.766 # Redis version=4.0.10, bits=64, commit=00000000, modified=0, pid=9084, just started
9084:C 30 Jun 23:16:40.766 # Configuration loaded
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.10 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6979
 |    `-._   `._    /     _.-'    |     PID: 9085
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

9085:M 30 Jun 23:16:40.770 # Server initialized
9085:M 30 Jun 23:16:40.771 * Ready to accept connections

Slave安装完之后将Master上的/opt/redis整个文件夹都拷贝过去

[root@Master bin]# scp -r /opt/redis root@10.15.43.16:/opt/

修改Slave上redis配置

[root@Slave redis]# cat /opt/redis/etc/redis_6979.conf |grep -v "#"|sed '/^[[:space:]]*$/d'
bind 0.0.0.0
protected-mode yes
port 6979
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize yes
supervised no
pidfile /var/run/redis_6979.pid
loglevel notice
logfile "../logs/redis_6979.log"
databases 16
always-show-logo yes
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump_6979.rdb
dir ../data/
slaveof 10.15.43.15 6979   //master地址 端口
masterauth 51cto      //master的密码，若是设置了须要开启此项
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100        #slave的优先级，是一个整数，展现在Redis的Info输出中。若是master再也不正常工做了，哨兵将这一个slave提高为master。
# 优先级数字小的salve会优先考虑提高为master，因此例若有三个slave优先级分别为10，100，25，哨兵将挑选优先级最小数字为10的slave。
# 0做为一个特殊的优先级，标识这个slave不能做为master，因此一个优先级为0的slave永远不会被哨兵挑选提高为master
requirepass 51cto
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
slave-lazy-flush no
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
[root@Slave redis]# echo 'vm.overcommit_memory = 1' >> /etc/sysctl.conf
[root@Slave redis]# sysctl -p
[root@Slave redis]# echo '511' > /proc/sys/net/core/somaxconn
[root@Slave redis]# echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
[root@Slave redis]# cd bin/
[root@Slave bin]# ./redis-server ../etc/redis.conf 
[root@Slave bin]# tail -500f ../logs/redis.log 
9025:C 30 Jun 23:31:06.487 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
9025:C 30 Jun 23:31:06.487 # Redis version=4.0.10, bits=64, commit=00000000, modified=0, pid=9025, just started
9025:C 30 Jun 23:31:06.487 # Configuration loaded
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.10 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6979
 |    `-._   `._    /     _.-'    |     PID: 9026
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

9026:S 30 Jun 23:31:06.491 # Server initialized
9026:S 30 Jun 23:31:06.491 * DB loaded from disk: 0.000 seconds
9026:S 30 Jun 23:31:06.491 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
9026:S 30 Jun 23:31:06.491 * Ready to accept connections
9026:S 30 Jun 23:31:06.491 * Connecting to MASTER 10.15.43.15:6979
9026:S 30 Jun 23:31:06.492 * MASTER <-> SLAVE sync started
9026:S 30 Jun 23:31:06.492 * Non blocking connect for SYNC fired the event.
9026:S 30 Jun 23:31:06.492 * Master replied to PING, replication can continue...
9026:S 30 Jun 23:31:06.494 * Trying a partial resynchronization (request 88f6e9e624a3f8af365809c46e8a3ee2f16945b7:1).
9026:S 30 Jun 23:31:06.494 * Successful partial resynchronization with master.
9026:S 30 Jun 23:31:06.494 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

在Slave机器上再起一个实例，复制上面从的配置文件，并修改端口为6980

[root@Slave bin]# cp ../etc/redis_6979.conf ../etc/redis_6980.conf 
[root@Slave bin]# sed -i 's/6979/6980/g' ../etc/redis_6980.conf
[root@Slave bin]# touch ../logs/redis_6980.log
[root@Slave bin]# ./redis-server ../etc/redis_6980.conf
[root@Slave bin]# tail -500f ../logs/redis_6980.log
1974:C 01 Jul 11:12:09.351 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1974:C 01 Jul 11:12:09.351 # Redis version=4.0.10, bits=64, commit=00000000, modified=0, pid=1974, just started
1974:C 01 Jul 11:12:09.351 # Configuration loaded
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.10 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6980
 |    `-._   `._    /     _.-'    |     PID: 1975
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

1975:S 01 Jul 11:12:09.355 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1975:S 01 Jul 11:12:09.355 # Server initialized
1975:S 01 Jul 11:12:09.355 * Ready to accept connections
1975:S 01 Jul 11:12:09.355 * Connecting to MASTER 10.15.43.15:6979
1975:S 01 Jul 11:12:09.355 * MASTER <-> SLAVE sync started
1975:S 01 Jul 11:12:09.356 * Non blocking connect for SYNC fired the event.
1975:S 01 Jul 11:12:09.356 * Master replied to PING, replication can continue...
1975:S 01 Jul 11:12:09.358 * Partial resynchronization not possible (no cached master)
1975:S 01 Jul 11:12:09.359 * Full resync from master: 88f6e9e624a3f8af365809c46e8a3ee2f16945b7:57196
1975:S 01 Jul 11:12:09.365 * MASTER <-> SLAVE sync: receiving 214 bytes from master
1975:S 01 Jul 11:12:09.366 * MASTER <-> SLAVE sync: Flushing old data
1975:S 01 Jul 11:12:09.366 * MASTER <-> SLAVE sync: Loading DB in memory
1975:S 01 Jul 11:12:09.366 * MASTER <-> SLAVE sync: Finished with success

主从验证

[root@Slave bin]# ./redis-cli -h 10.15.43.15 -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:master
connected_slaves:2
slave0:ip=10.15.43.16,port=6979,state=online,offset=187508,lag=0
slave1:ip=10.15.43.16,port=6980,state=online,offset=187508,lag=1
master_replid:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:187508
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:187508
[root@Slave bin]# ./redis-cli -h 10.15.43.16 -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.15.43.15
master_port:6979
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:187550
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:187550
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:56987
repl_backlog_histlen:130564
[root@Slave bin]# ./redis-cli -h 10.15.43.16 -p 6980 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.15.43.15
master_port:6979
master_link_status:up
master_last_io_seconds_ago:3
master_sync_in_progress:0
slave_repl_offset:187564
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:187564
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:57197
repl_backlog_histlen:130368
[root@Slave bin]# ./redis-cli -h 10.15.43.15 -p 6979 -a 51cto
Warning: Using a password with '-a' option on the command line interface may not be safe.
10.15.43.15:6979> set justin "51cto id:ityunwei2017"
OK
10.15.43.15:6979> get justin
"51cto id:ityunwei2017"
10.15.43.15:6979> quit
[root@Slave bin]# ./redis-cli -h 10.15.43.16 -p 6979 -a 51cto
Warning: Using a password with '-a' option on the command line interface may not be safe.
10.15.43.16:6979> get justin
"51cto id:ityunwei2017"
10.15.43.16:6979> del justin
(error) READONLY You can't write against a read only slave.
10.15.43.16:6979> quit
[root@Slave bin]# ./redis-cli -h 10.15.43.16 -p 6980 -a 51cto
Warning: Using a password with '-a' option on the command line interface may not be safe.
10.15.43.16:6980> get justin
"51cto id:ityunwei2017"
10.15.43.16:6980> del justin
(error) READONLY You can't write against a read only slave.
10.15.43.16:6980> quit
[root@Slave bin]#

在master上插入键值数据后在slave上能够获取到，主从同步正常，slave上只能查看，不能进行写操做。

哨兵(主从切换)配置

Redis官方提供了一个工具sentinel(哨兵)，sentinel在下载的redis源码里。

sentinel系统会执行如下3个任务：
一、监控(Monitoring)：不断的检查你的主服务器和从服务器是否运行正常；
二、通知(Notification)：当被监控的某个redis服务器出现问题时，Sentinel能够经过API向管理员或者其余应用程序发送通知；
三、自动故障迁移(Automatic failover)：当一个主服务器不能正常工做时，sentinel会开始一次自动故障迁移操做，它会将失效的主服务器的其中一个从服务器升级为新的主服务器，并让失效主服务器的其余从服务器为复制新的主服务器。当客户端试图链接失效的主服务器时，集群也会向客户端返回新主服务器地址，使得集群可使用新主服务器代替失效服务器。
sentinel的配置

[root@Slave bin]# cp /opt/soft/redis-4.0.10/sentinel.conf ./etc/
[root@Slave bin]# cp /opt/soft/redis-4.0.10/sentinel.conf ./etc/
[root@Slave bin]# mkdir /opt/redis/tmp
[root@Slave bin]# grep -v "#" ../etc/sentinel.conf |sed '/^[[:space:]]*$/d'
port 26979
dir ../tmp
sentinel monitor master15 10.15.43.15 6979 1
sentinel auth-pass master15 51cto
sentinel down-after-milliseconds master15 30000
sentinel parallel-syncs master15 1
sentinel failover-timeout master15 180000
logfile ../logs/sentinel.log
[root@Slave bin]#

master15表示要监控的主库的名字，能够本身定义。这个名字必须仅由大小写字母、数字和”.-”这3个字符组成。后两个参数表示主库的IP地址和端口号。最后的1表示最低经过票数，这里为了试验只使用了一个sentinel，实际坏境中建议使用2n+1个sentinel。

[root@Slave bin]# nohup ./redis-sentinel ../etc/sentinel.conf --sentinel &
[1] 3193
[root@Slave bin]# nohup: ignoring input and appending output to ‘nohup.out’
[root@Slave bin]# ./redis-cli -p 26979 info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=master15,status=ok,address=10.15.43.15:6979,slaves=2,sentinels=1
[root@Slave bin]# tail -500f ../logs/sentinel.log 
3193:X 02 Jul 14:04:46.745 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
3193:X 02 Jul 14:04:46.745 # Redis version=4.0.10, bits=64, commit=00000000, modified=0, pid=3193, just started
3193:X 02 Jul 14:04:46.745 # Configuration loaded
3193:X 02 Jul 14:04:46.747 * Running mode=sentinel, port=26979.
3193:X 02 Jul 14:04:46.747 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
3193:X 02 Jul 14:04:46.749 # Sentinel ID is 2d38409d6a2c8ffadd899180eb874afb17486c2c
3193:X 02 Jul 14:04:46.749 # +monitor master master15 10.15.43.15 6979 quorum 1
3193:X 02 Jul 14:04:46.750 * +slave slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:04:46.751 * +slave slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.15 6979

其中”+slave”表示新发现了从库，可见哨兵成功地发现了两个从库:10.15.43.16:697九、10.15.43.16:6980

这时候sentinel.conf文件被修改了

[root@Slave bin]# grep -v "#" ../etc/sentinel.conf |sed '/^[[:space:]]*$/d'
port 26979
dir "/opt/redis/tmp"
sentinel myid 2d38409d6a2c8ffadd899180eb874afb17486c2c
sentinel monitor master15 10.15.43.15 6979 1
sentinel auth-pass master15 51cto
sentinel config-epoch master15 0
sentinel leader-epoch master15 0
logfile "../logs/sentinel.log"
sentinel known-slave master15 10.15.43.16 6980
sentinel known-slave master15 10.15.43.16 6979
sentinel current-epoch 0
[root@Slave bin]#

如今哨兵已经在监控这3个Redis实例，这时将主库关闭（杀死进程或使用 shutdown 命令），等待指定时间后（down-after-milliseconds，默认为 30 秒），哨兵会输出以下内容：

[root@Slave bin]# tail -500f ../logs/sentinel.log 
3193:X 02 Jul 14:20:20.439 # +sdown master master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.439 # +odown master master15 10.15.43.15 6979 #quorum 1/1
3193:X 02 Jul 14:20:20.439 # +new-epoch 1
3193:X 02 Jul 14:20:20.439 # +try-failover master master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.441 # +vote-for-leader 2d38409d6a2c8ffadd899180eb874afb17486c2c 1
3193:X 02 Jul 14:20:20.441 # +elected-leader master master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.441 # +failover-state-select-slave master master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.531 # +selected-slave slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.532 * +failover-state-send-slaveof-noone slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.615 * +failover-state-wait-promotion slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.622 # +promoted-slave slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.622 # +failover-state-reconf-slaves master master15 10.15.43.15 6979
3193:X 02 Jul 14:20:20.692 * +slave-reconf-sent slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:21.283 * +slave-reconf-inprog slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:22.331 * +slave-reconf-done slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.15 6979
3193:X 02 Jul 14:20:22.394 # +failover-end master master15 10.15.43.15 6979
3193:X 02 Jul 14:20:22.394 # +switch-master master15 10.15.43.15 6979 10.15.43.16 6979
3193:X 02 Jul 14:20:22.394 * +slave slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.16 6979
3193:X 02 Jul 14:20:22.394 * +slave slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979
3193:X 02 Jul 14:20:52.440 # +sdown slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979

+sdown，表示哨兵主观认为主库中止服务了（Subjectively Down，简称 SDOWN）指的是单个 Sentinel 实例对服务器作出的下线判断
+odown，表示哨兵客观认为主库中止服务了（Objectively Down，简称 ODOWN）指的是多个 Sentinel 实例在对同一个服务器作出 SDOWN 判断，而且经过SENTINEL is-master-down-by-addr 命令互相交流以后，得出的服务器下线判断。（一个 Sentinel 能够经过向另外一个 Sentinel 发送 SENTINEL is-master-down-by-addr 命令来询问对方是否定为给定的服务器已下线。）
+try-failover，表示哨兵开始进行故障恢复
+failover-end，表示哨兵完成故障恢复，期间涉及包括领头哨兵的选举、备选从库的选择等，
+switch-master，表示开始发送切换master指令了，主库从10.15.43.15 6979迁移到10.15.43.16 6979，
+slave，列出了新主库的两个从库，10.15.43.16:6980、10.15.43.15:6979
+sdown，这里kill掉了原来的主库，哨兵主观认为主库中止服务了

哨兵并无完全清除中止服务实例的信息，这是由于中止服务的实例可能会在以后的某个时间恢复服务，这时哨兵会让其从新加入进来，因此当实例中止服务后，哨兵会更新该实例的信息，使得当其从新加入后能够按照当前信息继续对外提供服务。此例中10.15.43.15:6979的主库实例中止服务了，而10.15.43.16 6979的从库已经升级为主库，当6379端口的实例恢复服务后，会转变为6381端口实例的从库来运行，因此哨兵将6379端口实例的信息修改为了 6381端口实例的从库。

[root@Slave bin]# ./redis-cli -p 26979 info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=master15,status=ok,address=10.15.43.16:6979,slaves=2,sentinels=1
[root@Slave bin]# ./redis-cli -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:master
connected_slaves:1
slave0:ip=10.15.43.16,port=6980,state=online,offset=371290,lag=1
master_replid:9f381205761bc87469d017aa8cdf927154861d4a
master_replid2:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_repl_offset:371290
second_repl_offset:262772
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:56987
repl_backlog_histlen:314304
[root@Slave bin]# ./redis-cli -p 6980 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.15.43.16
master_port:6979
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:371852
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:9f381205761bc87469d017aa8cdf927154861d4a
master_replid2:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_repl_offset:371852
second_repl_offset:262772
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:57197
repl_backlog_histlen:314656
[root@Slave bin]#

此时，从新启动10.15.43.15 6979实例，查看sentinel日志

[root@Slave bin]# tail -500f ../logs/sentinel.log 
3193:X 02 Jul 14:55:28.440 # -sdown slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979
3193:X 02 Jul 14:55:38.398 * +convert-to-slave slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979

“-sdown”，表示实例10.15.43.15 6979已经恢复了服务（与+sdown相反），
”+convert-to-slave”，表示将10.15.43.15 6979端口的实例设置为10.15.43.16 6979实例的从库。

查看10.15.43.15的复制信息

[root@Slave bin]# ./redis-cli -h 10.15.43.15 -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.15.43.16
master_port:6979
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1
master_link_down_since_seconds:1530514926
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:312b860d03bc08ac37976005df1d8c6fc1378038
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
[root@Slave bin]#

10.15.43.15 6979 限制是slave，可是链接master失败，这是由于如今的master的设置了密码，须要在配置文件里加上masterauth "51cto"，为了能主动切换成功过在master设置了密码是也须要在master的配置文件里加上masterauth "51cto"配置好项，修改后重启服务

[root@Slave bin]# ./redis-cli -h 10.15.43.15 -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.15.43.16
master_port:6979
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:454348
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:9f381205761bc87469d017aa8cdf927154861d4a
master_replid2:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_repl_offset:454348
second_repl_offset:57197
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:57197
repl_backlog_histlen:397152
[root@Slave bin]# ./redis-cli -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:master
connected_slaves:2
slave0:ip=10.15.43.16,port=6980,state=online,offset=457432,lag=1
slave1:ip=10.15.43.15,port=6979,state=online,offset=457432,lag=0
master_replid:9f381205761bc87469d017aa8cdf927154861d4a
master_replid2:88f6e9e624a3f8af365809c46e8a3ee2f16945b7
master_repl_offset:457432
second_repl_offset:262772
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:56987
repl_backlog_histlen:400446
[root@Slave bin]#

若是此时，主10.15.43.16 6979宕机，在通常状况下，lag的值应该在0秒或者1秒之间跳动，若是超过1秒的话，那么说明主从服务器之间的链接出现了故障。

[root@Slave bin]# tail -500f ../logs/sentinel.log
3193:X 02 Jul 15:09:35.979 # +sdown master master15 10.15.43.16 6979
3193:X 02 Jul 15:09:35.980 # +odown master master15 10.15.43.16 6979 #quorum 1/1
3193:X 02 Jul 15:09:35.980 # +new-epoch 2
3193:X 02 Jul 15:09:35.980 # +try-failover master master15 10.15.43.16 6979
3193:X 02 Jul 15:09:35.983 # +vote-for-leader 2d38409d6a2c8ffadd899180eb874afb17486c2c 2
3193:X 02 Jul 15:09:35.983 # +elected-leader master master15 10.15.43.16 6979
3193:X 02 Jul 15:09:35.983 # +failover-state-select-slave master master15 10.15.43.16 6979
3193:X 02 Jul 15:09:36.075 # +selected-slave slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:36.075 * +failover-state-send-slaveof-noone slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:36.131 * +failover-state-wait-promotion slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:36.858 # +promoted-slave slave 10.15.43.15:6979 10.15.43.15 6979 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:36.858 # +failover-state-reconf-slaves master master15 10.15.43.16 6979
3193:X 02 Jul 15:09:36.916 * +slave-reconf-sent slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:37.879 * +slave-reconf-inprog slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:37.879 * +slave-reconf-done slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.16 6979
3193:X 02 Jul 15:09:37.961 # +failover-end master master15 10.15.43.16 6979
3193:X 02 Jul 15:09:37.962 # +switch-master master15 10.15.43.16 6979 10.15.43.15 6979
3193:X 02 Jul 15:09:37.962 * +slave slave 10.15.43.16:6980 10.15.43.16 6980 @ master15 10.15.43.15 6979
3193:X 02 Jul 15:09:37.962 * +slave slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979
3193:X 02 Jul 15:10:07.982 # +sdown slave 10.15.43.16:6979 10.15.43.16 6979 @ master15 10.15.43.15 6979

主自动切换到10.15.43.15 6979

[root@Slave bin]# ./redis-cli -p 26979 info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=master15,status=ok,address=10.15.43.15:6979,slaves=2,sentinels=1
[root@Slave bin]# ./redis-cli -p 6980 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.15.43.15
master_port:6979
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:472682
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:d92482767f5f83abc3aacb8057081c01a192c122
master_replid2:9f381205761bc87469d017aa8cdf927154861d4a
master_repl_offset:472682
second_repl_offset:463436
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:57197
repl_backlog_histlen:415486
[root@Slave bin]# ./redis-cli -h 10.15.43.15 -p 6979 -a 51cto info replication
Warning: Using a password with '-a' option on the command line interface may not be safe.
# Replication
role:master
connected_slaves:1
slave0:ip=10.15.43.16,port=6980,state=online,offset=473381,lag=1
master_replid:d92482767f5f83abc3aacb8057081c01a192c122
master_replid2:9f381205761bc87469d017aa8cdf927154861d4a
master_repl_offset:473518
second_repl_offset:463436
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:57197
repl_backlog_histlen:416322
[root@Slave bin]#

哨兵进程启动时读取配置文件的内容，经过sentinel monitor master-name ip redis-port quorum项找出须要监控的主库， master-name 是主库的名字，由于考虑到故障恢复后当前监控的系统的主库的地址和端口会产生变化，因此哨兵提供了命令能够经过主库的名字获取当前系统的主库的IP地址和端口号。
一个哨兵节点能够同时监控多个Redis主从系统，只须要定义每一个主库的名字”sentinel monitor”配置便可，一个哨兵节点能够同时监控多个Redis主从系统，只须要提供多个”sentinel monitor”配置便可。snetinel的状态会被持久化地写入sentinel的配置文件中。每次当收到一个新的配置时，或者新建立一个配置时，配置会被持久化到硬盘中，并带上配置的版本戳，这样就能够安全的中止和重启sentinel进程。

哨兵启动后，会与要监控的主库创建两条链接：

一、一条用来订阅该主库的”sentinel:hello”频道，以获取其余一样监控该数据库的哨兵节点的信息。
二、一条按期向主库发送info等命令来获取主库自己的信息，因为进入订阅模式时就不能再执行其余命令了，因此这时哨兵会使用另一条链接来发送这些命令。

和主库的链接创建完成后，哨兵会定时执行3个操做:

一、每10秒向主库和从库发送info命令
发送info命令使哨兵能够得到当前数据库的相关信息（包括运行ID、复制信息等）从而实现新节点的自动发现。配置哨兵监控 Redis 主从系统时只须要指定主库的信息便可，由于哨兵正是借助info命令来获取全部复制该主库的从库信息的。
启动后，哨兵向主库发送info命令获得从库列表，然后对每一个从库一样创建两个链接，两个链接的做用和与主库创建的两个链接彻底一致。在此以后，哨兵会每 10 秒定时向已知的全部主从库发送info命令来获取更新信息，并进行相应操做。好比对新增的从库创建链接并加入监控列表，对主从库的角色变化（由故障恢复操做引发）进行信息更新等。
二、每2秒向主库和从库的”sentinel:hello”频道发送本身的信息，与一样监控该数据库的哨兵分享本身的信息。也就是说哨兵不但订阅了该频道，并且还会向该频道发布信息，以使其余哨兵获得本身的信息；
发送的消息内容为：
<哨兵的地址>,<哨兵的端口>, <哨兵运行ID>, <哨兵的配置版本>, <主库的名字>, <主库的地址>, <主库的端口>, <主库的配置版本>
哨兵会订阅每一个其监控的数据库的”sentinel:hello”频道，因此当其余哨兵收到消息后，会判断发消息的哨兵是否是新发现的哨兵。若是是，则将其加入已发现的哨兵列表中并建立一个到其的链接（与数据库不一样，哨兵与哨兵之间只会建立一条链接用来发送ping命令，而不须要建立另一条链接来订阅频道，由于哨兵只须要订阅数据库的频道便可实现自动发现其余哨兵）。
同时，哨兵会判断信息中主库的配置版本，若是该版本比当前记录的主库的版本高，则更新主库的数据。
三、每1秒向主库、从库和其余哨兵节点发送ping命令。
发送ping的时间间隔与”down-after-milliseconds”选项有关，最长间隔为1秒。当”down-after-milliseconds”的值小于1秒时，哨兵会每隔”down-after-milliseconds”指定的时间发送一次ping命令，当down-after-milliseconds的值大于1秒时，哨兵会每隔1秒发送一次ping命令。
若是超过”down-after-milliseconds”指定时间后，被ping的节点仍未回复，则哨兵认为其主观下线(subjectively down)。主观下线表示，从当前的哨兵进程看来，该节点已经下线。若是该节点是主库，则哨兵会进一步判断是否须要对其进行故障恢复：哨兵发送”SENTINEL is-master-down-by-addr”命令询问其余哨兵节点以了解他们是否也认为该主库主观下线，如果达到指定数量时，哨兵会认为其客观下线(objectively down)，并选举领头的哨兵节点发起故障恢复。例如sentinel monitor master15 10.15.43.15 6979 3表示只有当quorum至少3个哨兵节点（包括当前节点）认为该主库主观下线时，当前哨兵节点才会认为该主库客观下线。
当哨兵节点发现了主库客观下线，须要故障恢复，故障恢复须要由领头的哨兵来完成，这样能够保证同一时间只有一个哨兵节点来执行故障恢复。选举领头哨兵的过程使用了 Raft算法:
一、发现主库客观下线的哨兵节点（哨兵A）向每一个哨兵节点发送命令，要求对方选本身成为领头哨兵。
二、若是目标哨兵节点没有选过其余人，则会赞成将哨兵A设置成领头哨兵。
三、若是哨兵A发现有超过半数且超过quorum参数值的哨兵节点赞成选本身成为领头哨兵，则哨兵A成功成为领头哨兵。
四、如有多个哨兵节点同时参选领头哨兵，则会出现没有任何节点当选的可能。此时每一个参选节点将等待一个随机时间从新发起参选请求，进行下一轮选举，直到选举成功。
选出领头哨兵后，领头哨兵开始对主库进行故障恢复：
a、领头哨兵将从中止服务的主库的从库中挑选一个来充当新的主库：
一、全部在线的从库中，选择优先级最高的从库。优先级能够经过”slave-priority”选项来设置；
二、若是有多个最高优先级的从库，则复制的命令偏移量越大（即复制越完整）越优先；
三、若是以上条件都同样，则选择运行ID较小的从库。
b、领头哨兵将向从库发送”slaveof no one”命令，使其升级为主库。然后领头哨兵向其余从库发送slaveof命令来使其成为新主库的从库。
c、更新内部记录，将已经中止服务的，旧的主库更新为新的主库的从库，使得当其恢复服务时自动以从库的身份继续服务。
若是一个主从系统中配置的哨兵较少，哨兵对整个系统的判断的可靠性就会下降。当节点较少时建议为每一个节点（不管是主库仍是从库）部署一个哨兵，同时设置 quorum 的值为 N/2 + 1（其中N为哨兵节点数量）；当系统中的节点较多时，考虑到每一个哨兵都会和系统中的全部节点创建链接，为每一个节点分配一个哨兵会产生较多链接，尤为是当进行客户端分片时使用多个哨兵节点监控多个主库，会由于 Redis 不支持链接复用而产生大量冗余链接
slave的选举主要会评估slave的如下几个方面：
与master断开链接的次数
Slave的优先级
数据复制的下标(用来评估slave当前拥有多少master的数据)
进程ID
若是一个slave与master失去联系超过10次，而且每次都超过了配置的最大失联时间(down-after-milliseconds)，若是sentinel在进行failover时发现slave失联，那么这个slave就会被sentinel认为不适合用来作新master的。
为何要先得到大多数sentinel的承认时才能真正去执行failover呢？

当一个sentinel被受权后，它将会得到宕掉的master的一份最新配置版本号，当failover执行结束之后，这个版本号将会被用于最新的配置。由于大多数sentinel都已经知道该版本号已经被要执行failover的sentinel拿走了，因此其余的sentinel都不能再去使用这个版本号。这意味着，每次failover都会附带有一个独一无二的版本号。咱们将会看到这样作的重要性。

并且，sentinel集群都遵照一个规则：若是sentinel A推荐sentinel B去执行failover，B会等待一段时间后，自行再次去对同一个master执行failover，这个等待的时间是经过failover-timeout配置项去配置的。从这个规则能够看出，sentinel集群中的sentinel不会再同一时刻并发去failover同一个master，第一个进行failover的sentinel若是失败了，另一个将会在必定时间内进行从新进行failover，以此类推。

redis sentinel保证了活跃性：若是大多数sentinel可以互相通讯，最终将会有一个被受权去进行failover.
redis sentinel也保证了安全性：每一个试图去failover同一个master的sentinel都会获得一个独一无二的版本号。

sentinel集群中各个sentinel也有互相通讯，经过gossip协议

一旦一个sentinel成功地对一个master进行了failover，它将会把关于master的最新配置经过广播形式通知其它sentinel，其它的sentinel则更新对应master的配置。

一个faiover要想被成功实行，sentinel必须可以向选为master的slave发送SLAVEOF NO ONE命令，而后可以经过INFO命令看到新master的配置信息。

当将一个slave选举为master并发送SLAVEOF NO ONE后，即便其它的slave还没针对新master从新配置本身，failover也被认为是成功了的，而后全部sentinels将会发布新的配置信息。

新配在集群中相互传播的方式，就是为何咱们须要当一个sentinel进行failover时必须被受权一个版本号的缘由。

即便当前没有failover正在进行，sentinel依然会使用当前配置去设置监控的master，当最新配置确认为slaves的节点却声称本身是master，这时它们会被从新配置为当前master的slave。若是slaves链接了一个错误的master，将会被改正过来，链接到正确的master。

每一个sentinel使用##发布/订阅##的方式持续地传播master的配置版本信息，配置传播的##发布/订阅##管道是：sentinel:hello。

由于每个配置都有一个版本号，因此以版本号最大的那个为标准。

不一样机房部署redis主从

有三个主机，每一个主机分别运行一个redis和一个sentinel，redis2和redis3在一个机房A，redis1在另一个机房B，当机房A与机房B网络中断时，sentinel3和sentinel2启动了failover并把redis2选举为master。此时sentinel1依然是旧的配置，由于它与sentinel三、sentinel2隔离了。当网络恢复之后，sentinel1将会更新它的配置,讲redis1变成redis2的slave，而在网络断开期间客户端依然能够向redis1写入数据，这样网络恢复后，客户端在网络断开期间写入redis1的数据就会丢失。

Redis的min-slaves-to-write和min-slaves-max-lag两个选项能够防止主服务器在不安全的状况下执行写命令。这个时候能够经过修改redis配置，让网络断开期间redis1拒绝客户端的写请求。min-slaves-to-write和min-slaves-max-lag两个选项能够防止主服务器在不安全的状况下执行写命令。min-slaves-to-write 2 min-slaves-max-lag 10从服务器的数量少于2个，或者2个从服务器的延迟（lag）值都大于或等于10秒时，主服务器将拒绝执行写命令，这里的延迟值是INFO replication命令的lag值。