[ZooKeeper研究]二 ZooKeeper协议介绍

时间 2019-11-12

原文原文链接

前面介绍了ZooKeeper的基本知识，这一节咱们介绍一下ZooKeeper使用的协议。只有了解了ZooKeeper的协议，才能更好得理解ZooKeeper源代码的实现。ZooKeeper使用的是Zab（ZooKeeper Atomic Broadcast）协议，它是基于Paoxs算法实现的。因此这一节咱们按照这个顺序来说解：html

Paoxs算法
Zab协议

Paoxs算法算法

首先看一下Paoxs算法，通常说到zookeeper,咱们都会提起Paoxs算法和Lesile Lamport.promise

Paoxs算法是zookeeper的灵魂,这个算法是Leslie Lamport在1990年提出的一种基于消息传递的一致性算法.Paxos 算法解决的问题是一个分布式系统如何就某个值（决议）达成一致。在ZooKeeper中的应用场景就是Leader选举。安全

该算法由Leslie于1990年在文章The Part-Time Parliament中首次提出,可是这篇文章至关的晦涩难懂(也有一些轶事,能够看文章连接中Leslie本身写的内容),因而,Lesilie在2001年写下了Paxos Made Simple.他对此解释道:app

At the PODC 2001 conference, I got tired of everyone saying how difficult it was to understand the Paxos algorithm, published in [122]. Although people got so hung up in the pseudo-Greek names that they found the paper hard to understand, the algorithm itself is very simple. So, I cornered a couple of people at the conference and explained the algorithm to them orally, with no paper. When I got home, I wrote down the explanation as a short note, which I later revised based on comments from Fred Schneider and Butler Lampson. The current version is 13 pages long, and contains no formula more complicated than n1 > n2. less

Paxos Made Simple的abstract只有一句话:分布式

The Paxos algorithm, when presented in plain English, is very simple.ide

在上文中是这样描述Paoxs算法执行过程的:性能

Phase 1.优化

(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.

(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

Phase 2.

(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.

(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

这几乎就是Paxos的所有了. 这里就不一句一句翻译了。

Zab协议

Zab是一个高性能的广播协议，主要用于主备系统，它是专门为ZooKeeper设计的。Zab协议的详细内容请参考论文《Zab: High-performance broadcast for primary-backup systems》。

ZAB相比Paxos的优势有：

状态一致性保证，为了保证状态一致性，Zookeeper提出了两个安全属性（Safety Property）：
- 全序（Total order）：若是消息a在消息b以前发送，则全部Server应该看到相同的结果。
- 因果顺序（Causal order）：若是消息a在消息b以前发生（a致使了b），并被一块儿发送，则a始终在b以前被执行。
更高效得从失败中恢复。状态一致性保证了多个ZooKeeper客户端同时进行多个事务操做的正确性。因此当一个leader挂掉以后，新的leader只须要从选举它的一个多数派中得到当前最大的事务号做为它的恢复点，这样leader最多只须要跟具备最大的事务号的那一个进程（若是有多个，因为状态一致性保证，随机挑一个便可）同步便可。而在Paoxs中，因为同一个事务（这里其实是指序列号）能够有多个不一样的投票，不一样的进程对于同一个事务可能接受不一样的值，因此不能简单的使用事务号来选择恢复点，新的leader须要找到当前最大的事务号，而后对于它已经commit的事务号跟这个最大的事务号之间的每个事务号都须要从新执行一遍Phase1获得最终commit的值，这样的恢复过程复杂并且不够高效。

　　Paxos的一致性不能达到ZooKeeper的要求。由于Paoxs只是保证最终一致性，若是处理的请求之间有依赖关系，利用Paoxs处理的时候可能知足不了这些依赖关系。举个例子：假设一开始Paxos系统中的leader是P1，它发起了两个事务<t1, v1>（表示序号为t1的事务要写的值是v1）和<t2, v2>，过程当中挂了。新来个leader是P2，它发起了事务<t1, v1'>。然后又来个新leader是P3，它汇总了一下，得出最终的执行序列<t1, v1'>和<t2, v2>，即P2的t1在前，P1的t2在后。对应到ZooKeeper中的操做，P1对应的事务t1要建立"/a"，事务t2要建立"/a/test"，而P2的事务t1要建立"/b"，P3汇总了以后得出的结论是先建立"/b"，再建立"/a/test"。而对于ZooKeeper中的建立操做，只有父节点已经存在的状况下才能建立子节点，也即只有先成功建立了"/a"，接下来建立"/a/test"才能成功，因此建立完"/b"以后再建立"/a/test"就会失败，这不是咱们但愿的结果。

　　为了保证这一点，ZAB要保证同一个leader的发起的事务要按顺序被apply，同时还要保证只有先前的leader的全部事务都被apply以后，新选的leader才能发起新的事务。ZAB的核心思想，形象的说就是保证任意时刻只有一个节点是leader，全部更新事务都由leader发起去更新全部复本（称为follower），更新时用的就是两阶段提交协议，只要多数节点prepare成功，就通知他们commit。各follower要按当初leader让他们prepare的顺序来apply事务。由于ZAB处理的事务永远不会回滚，ZAB的2阶段提交作了点优化，多个事务只要通知zxid最大的那个commit，以前的各follower会通通commit。

　　剩下的就是怎么来保证leader的可靠性，由于leader是会crash的，因此引入了leader选举机制。leader选举是基于Paoxs协议的，成为leader的条件是必需要有一个多数派支持，此外还须要知道如下知识：

leader跟follower之间经过心跳来检测异常；
follower检测到leader心跳异常以后，会从新发起leader选举，一个follower若试图成为新的leader，首先要得到一个多数派的支持，而后从状态最新的节点同步事务，完成后才可正式成为leader发起新的事务；

　　Leader选举遇到的最大问题是：新Leader是否要继续老Leader的状态。这里要按老Leader Crash的时机分两种状况：

老Leader在COMMIT前Crash（已经提交到本地）
老Leader在COMMIT后Crash，但有部分Follower接收到了Commit请求

　　第一种状况，这些数据只有老Leader本身知道，当老Leader重启后，须要与新Leader同步并把这些数据从本地删除，以维持状态一致。

　　第二种状况，新Leader应该能经过一个多数派得到老Leader提交的最新数据。

　　老Leader重启后，可能还会认为本身是Leader，可能会继续发送未完成的请求，从而由于两个Leader同时存在致使算法过程失败，ZooKeeper的解决办法是把Leader信息加入每条消息的id中，Zookeeper中称为zxid，zxid为一64位数字，高32位为leader信息又称为epoch，每次leader转换时递增；低32位为消息编号，Leader转换时应该从0从新开始编号。经过zxid，Follower能很容易发现请求是否来自老Leader，从而拒绝老Leader的请求。

综上可见，Zab协议实际上仍是基于Paoxs衍生出来的，Paoxs中没有保证请求之间的逻辑顺序，只考虑数据的全序，Zab在这方面进行了完善补充，同时因为leader的存在，简化了Paoxs的二段提交为一段提交（Phase2），最后为了确保leader的可靠性，又基于Paoxs协议实现了leader的选举机制。