1.问题表现 version 3.5.9node
Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb900007f14eb4b and self bb9ffe723270008 Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes=bb900007f14eb4b Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9ffe723270008 retransmitting sync messages to nodes that have not responded yet ... Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb900007f14eb4b Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [9.0]@bb9ffe723270008: bb9ffe723270008 bb900007f14eb4b Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb900007f14eb4b and self bb9ffe723270008 Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes=bb900007f14eb4b Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9ffe723270008 retransmitting sync messages to nodes that have not responded yet ... Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb900007f14eb4b Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [9.0]@bb9ffe723270008: bb9ffe723270008 bb900007f14eb4b
2.Cluster Integrity Check网络
// for each node in the succession list // compare the node's succession list with this server's succession list bool cluster_integrity_fault = false; bool are_nodes_not_dunned = false; for (int i = 0; i < g_config.paxos_max_cluster_size; i++) { cf_debug(AS_PAXOS, "Cluster Integrity Check: %d, %"PRIx64"", i, succ_list_index[i]); if (succ_list_index[i] == (cf_node) 0) { break; // we are done }
3.CLUSTER INTEGRITY FAULTsocket
switch (g_config.paxos_recovery_policy) { case AS_PAXOS_RECOVERY_POLICY_MANUAL: { if (are_nodes_not_dunned) { snprintf(sbuf, 97, "CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes: dun:nodes="); } else { snprintf(sbuf, 99, "CLUSTER INTEGRITY FAULT. [Phase 2 of 2] To fix, issue this command across all nodes: undun:nodes="); } bool nodes_missing = false; for (int i = 0; i < g_config.paxos_max_cluster_size; i++) { if ((cf_node)0 == missing_nodes[i]) { break; } snprintf(sbuf + strlen(sbuf), 18, "%"PRIx64",", missing_nodes[i]); nodes_missing = true; }
4.缘由分析tcp
只要出现两个节点间不能互相经过3002端口同步状态,就会出现上述问题ide
致使该问题的缘由有不少种this
①防火墙debug
#验证方法 telnet ip:port
②进程fd耗尽,致使没法建立socketcode
#as默认fd数量aerospike.conf proto-fd-max 15000 #验证方法 ll /proc/pid/fd|grep socket |wc-l lsof -p asd-pid|grep can't identify protocol|wc -l
100 BB9FFE723270008 192.168.56.100 101 BB900007F80090B 192.168.56.101 101 能链接100 100没法链接101就出现上面 as集群各节点状态确认问题 101节点能够链接100的3002端口 但100节点没法链接101的3002端口
101链接100的3002server
[root@c101 ~]# telnet 192.168.56.100 3002 Trying 192.168.56.100... Connected to 192.168.56.100. Escape character is '^]'. Mhc
100链接101的3002进程
[root@c100 ~]# telnet 192.168.56.101 3002 Trying 192.168.56.101... telnet: connect to address 192.168.56.101: No route to host [root@c100 ~]#
100节点网络状态
[root@c100 ~]# netstat -nat Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN tcp 0 0 192.168.56.100:3002 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:3003 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:3000 0.0.0.0:* LISTEN tcp 0 0 192.168.56.100:3002 192.168.56.101:58930 ESTABLISHED tcp 0 52 192.168.56.100:22 192.168.56.1:52622 ESTABLISHED tcp 0 0 192.168.56.100:22 192.168.56.1:52188 ESTABLISHED tcp 0 0 192.168.56.100:22 192.168.56.1:52031 ESTABLISHED tcp6 0 0 :::3306 :::* LISTEN tcp6 0 0 :::22 :::* LISTEN [root@c100 ~]#
101节点网络状态
[root@c101 ~]# netstat -nat Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp 0 0 192.168.56.101:3002 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:3003 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:3000 0.0.0.0:* LISTEN tcp 0 0 192.168.56.101:22 192.168.56.1:55739 ESTABLISHED tcp 0 0 192.168.56.101:58930 192.168.56.100:3002 ESTABLISHED tcp 0 52 192.168.56.101:22 192.168.56.1:55723 ESTABLISHED tcp 0 0 192.168.56.101:22 192.168.56.1:55135 ESTABLISHED tcp6 0 0 ::1:25 :::* LISTEN tcp6 0 0 :::22 :::* LISTEN [root@c101 ~]#
5.问题还原重现方法
节点A能够链接节点B,节点B没法链接节点A 将节点A的防火墙打开便可