aerospike CLUSTER INTEGRITY FAULT 问题分析

1.问题表现 version 3.5.9node

Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb900007f14eb4b and self bb9ffe723270008
Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb900007f14eb4b
Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9ffe723270008 retransmitting sync messages to nodes that have not responded yet ... 
Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb900007f14eb4b
Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [9.0]@bb9ffe723270008: bb9ffe723270008 bb900007f14eb4b 
Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb900007f14eb4b and self bb9ffe723270008
Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb900007f14eb4b
Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9ffe723270008 retransmitting sync messages to nodes that have not responded yet ... 
Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb900007f14eb4b
Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [9.0]@bb9ffe723270008: bb9ffe723270008 bb900007f14eb4b

2.Cluster Integrity Check网络

// for each node in the succession list
	// compare the node's succession list with this server's succession list

	bool cluster_integrity_fault = false;
	bool are_nodes_not_dunned = false;
	for (int i = 0; i < g_config.paxos_max_cluster_size; i++) {
		cf_debug(AS_PAXOS, "Cluster Integrity Check: %d, %"PRIx64"", i, succ_list_index[i]);
		if (succ_list_index[i] == (cf_node) 0) {
			break; // we are done
		}

3.CLUSTER INTEGRITY FAULTsocket

switch (g_config.paxos_recovery_policy) {

			case AS_PAXOS_RECOVERY_POLICY_MANUAL:
			{
				if (are_nodes_not_dunned) {
					snprintf(sbuf, 97, "CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=");
				} else {
					snprintf(sbuf, 99, "CLUSTER INTEGRITY FAULT. [Phase 2 of 2] To fix, issue this command across all nodes:  undun:nodes=");
				}

				bool nodes_missing = false;
				for (int i = 0; i < g_config.paxos_max_cluster_size; i++) {
					if ((cf_node)0 == missing_nodes[i]) {
						break;
					}

					snprintf(sbuf + strlen(sbuf), 18, "%"PRIx64",", missing_nodes[i]);
					nodes_missing = true;
				}

4.缘由分析tcp

只要出现两个节点间不能互相经过3002端口同步状态,就会出现上述问题ide

致使该问题的缘由有不少种this

①防火墙debug

#验证方法
telnet ip:port

②进程fd耗尽,致使没法建立socketcode

#as默认fd数量aerospike.conf
proto-fd-max 15000
#验证方法
ll /proc/pid/fd|grep socket  |wc-l
lsof -p  asd-pid|grep can't identify protocol|wc -l
100 BB9FFE723270008  192.168.56.100
101 BB900007F80090B  192.168.56.101
101 能链接100  100没法链接101就出现上面 as集群各节点状态确认问题
101节点能够链接100的3002端口 但100节点没法链接101的3002端口

101链接100的3002server

[root@c101 ~]# telnet 192.168.56.100 3002
Trying 192.168.56.100...
Connected to 192.168.56.100.
Escape character is '^]'.
Mhc

100链接101的3002进程

[root@c100 ~]# telnet 192.168.56.101 3002
Trying 192.168.56.101...
telnet: connect to address 192.168.56.101: No route to host
[root@c100 ~]#

100节点网络状态

[root@c100 ~]# netstat -nat
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:3001            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.56.100:3002     0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3003            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.56.100:3002     192.168.56.101:58930    ESTABLISHED
tcp        0     52 192.168.56.100:22       192.168.56.1:52622      ESTABLISHED
tcp        0      0 192.168.56.100:22       192.168.56.1:52188      ESTABLISHED
tcp        0      0 192.168.56.100:22       192.168.56.1:52031      ESTABLISHED
tcp6       0      0 :::3306                 :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
[root@c100 ~]#

101节点网络状态

[root@c101 ~]# netstat -nat
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:3001            0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.56.101:3002     0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3003            0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN     
tcp        0      0 192.168.56.101:22       192.168.56.1:55739      ESTABLISHED
tcp        0      0 192.168.56.101:58930    192.168.56.100:3002     ESTABLISHED
tcp        0     52 192.168.56.101:22       192.168.56.1:55723      ESTABLISHED
tcp        0      0 192.168.56.101:22       192.168.56.1:55135      ESTABLISHED
tcp6       0      0 ::1:25                  :::*                    LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
[root@c101 ~]#

5.问题还原重现方法

节点A能够链接节点B,节点B没法链接节点A
将节点A的防火墙打开便可
相关文章
相关标签/搜索