zookeeper实现分析之二：ZAB

时间 2019-11-17

标签 zookeeper 实现分析之二 zab 栏目 Zookeeper 繁體版

原文原文链接

一年多前学习zookeeper时作的笔记，主要是翻译自“ZooKeeper's atomic broadcast protocol:Theory and practice”，并添加了本身的一些理解，整理一下做为一篇博客贴出来，后续有时间会分析一下在zookeeper源码中，zab是如何实现的，以及zab与paxos的区别。 -------------------------------------------------------------------------- <h3>1 Consistency Guarantees</h3> Zookeeper不能保证强一致性，客户端能看到older数据。Zookeeper提供顺序一致性。 Zookeeper的一致性保证： 一、顺序一致性：客户端的更新通知是严格按照顺序进行发送。 二、原子性：更新操做要么成功要么失败，没有中间状态。 三、Single system image：无论客户端链接哪个服务器，客户端看到的都是the same view of service。 四、Reliability：一旦一个更新成功，那么那就会被持久化，直到客户端用新的更新覆盖这个更新。 五、Timeliness：Zookeeper确保客户端在必定时间内（几十秒）完成或看到系统的数据更新。 那么zab是如何确保这些一致性相关的特色。 Zab的两个重要的要求以下： 一、支持同时处理多个outstanding的客户端写操做。一个outstanding事务的含义是事务已经被提交但没有被commit。 二、有效的从crash状态恢复过来。 Zookeeper能处理并发地处理多个客户端的outstanding 写请求，而且以FIFO顺序commit这些写操做。FIFO的特性对于zookeeper可以有效的从crash状态恢复过来也是相当重要的。 原始的paxos协议不能同时处理多个outstanding transaction，paxos不要求通讯时的FIFO通道特性，paxos能够容忍消息丢失和从新排序。 在paxos中，从primary crash中恢复过来并保证事务的序列化的能力不是足够有效，而zab改进了这方面的能力，采用了一个事务ID来实现事务的totally order。 Zookeeper的性能要求以下： 一、低延时（low latency）。 二、 Good throughput。高吞吐量。 三、 Smooth failure handling。容错。 在这种状况下，为了能有效地更新一个new primary的应用程序状态，在zab中new primary会被指望拥有最高事务ID的进程，整个集群能够经过从new primary中拷贝事务，从而全部数据副本均可以达到一个一致性。 而在paxos，没有采用相似zab的序列号，因此一个新的primary须要执行paxos算法的第一阶段，以便于获取到全部primary没有学习到值。 <h5>2 ZAB协议和流程介绍</h5> Zab协议有四个阶段，以下： 一、阶段0：Leader election 二、阶段1：Discovery（或者epoch establish） 三、阶段2：Synchronization（或者sync with followers） 四、阶段3：Broadcast 在Zab协议的实现时，合并为三个阶段： 一、 Fast Leader Election 二、 Recovery Phase 三、 Broadcast Phase 在实现中将discovery和synchronization这两个phase合并成了broadcast phase。 ZAB的流程图以下所示： <a href="http://static.oschina.net/uploads/img/201312/22194147_iKYR.jpg"><img title="1" style="border-right: 0px; border-top: 0px; display: inline; border-left: 0px; border-bottom: 0px" height="322" alt="1" src="http://static.oschina.net/uploads/img/201312/22194147_K2bP.jpg" width="500" border="0" /></a> CEPOCH = Follower sends its last promise to the prospective leader NEWEPOCH = Leader proposes a new epoch e' ACK-E = Follower acknowledges the new epoch proposal NEWLEADER = Prospective leader proposes itself as the new leader of epoch e' ACK-LD = Follower acknowledges the new leader proposal COMMIT-LD = Commit new leader proposal PROPOSE = Leader proposes a new transaction ACK = Follower acknowledges leader proosal COMMIT = Leader commits proposal <h3>3 Leader election</h3> <h4>3.1 leader election后置条件</h4> Leader election可能有多种方式，但在这里咱们只分析一种，fast leader election。 Leader election后置条件： 一、条件：Leader election这个过程必须保证选举出来的leader能看到全部历史的commited transactions。 二、缘由：这个后置条件是为了确保在后续recovery phase步骤中zookeeper replicas的一致性。它是防止follower中包含leader中没有的committed transaction，并且在recovery phase中只有leader向follower和observer同步，follower不会向leader同步，若是出现这种状况，那么zookeeper的replicas就出现了不一致的状况。 因此为了达到这个后置条件，leader election须要选择出一个拥有highest lastZxid的leader。 那么fast leader election是如何选择出一个拥有highest lastZxid的leader？ <h4>3.2 Fast leader election介绍</h4> 在进行fast leader election过程当中，为了选举出一个拥有highest lastZxid的leader（能看到最新的历史committed transaction），处于election状态的peer servers会对其余peer server进行表决。Peer server会交换他们的vote（选举）的通知。同时当peer server发现一个拥有recent history的peer server（peer server拥有higher history Zxid），peer server会更新其自身的vote。当选举出一个leader后，而后进入recovery phase，fast leader election就结束了，假如vote选举出来leader就是peer server自身，那么peer server变成leading状态（fast leader election过程当中，peer server自己的状态是following），其余的peer server则进入following状态。若是后续的recovery phase和broadcast phase发生任何失败的状况，那么peer server都会回到election状态，从新启动fast leader election。 <h4>3.3 Epoch number</h4> Epoch是用于区分每个round，每一次创建一个新的leader-follower关系，都会有一个惟一的epoch值去标识。就好像皇帝登基必须得有一个年号，与以前或以后的皇帝进行区分。 Epoch在两个过程当中用到：一、leader election时。二、recovery过程（新创建一个leader-follower关系）。 一、过程1：每个fast leader election开始时epoch的值都为0，epoch的值会在fast leader election过程当中进行更新。 我的理解每一个zookeeper节点刚启动时没有leader-follower关系视图，那么它就会认为本身是leader，而后发起leader electoin，那么这个leader election的epoch值为0；在leader election过程当中，将epoch更新到currepoch值（其余peer server中的最高的epoch）。使用epoch number来区分不一样的fast leader election过程。就好像你想当皇帝，定了一个年号发起登基过程，若是当前有其余皇帝存在，且他的年号比你的年号更新，那么你就得更新年号，从新发起登基，谁支持的人多谁就是皇帝；若是没有其余皇帝存在，但有其余人也在登基，那么你们就一块儿比比，看谁的年号更新，看谁的资格更老（一样的epoch，vote值越大越优先），那么选举谁当皇帝。 二、过程2：在一个faster leader election结束后，新产生的leader会获取epoch，其值为lastest history zxid的高32位，而后对epoch自增，而后用新的epoch值做为新zxid的高32，zxid的低32位为0。一旦当上皇帝后，就发布一个新的年号。 这里有矛盾的地方： 两个过程的epoch是不是同一个？过程1的epoch是不会持久化的。过程2中由于zxid是持久化的，那么至关于epoch也是持久化的。因此不理解。 <h4>3.4 选取出highest zxid的leader</h4> 为了能选举出highest zxid的leader，那么就须要对vote进行比较。 对于peer server集合 PSET = {p1, p2, p3, …., pn}，其中{1, 2, 3, …. , n }是peer server的ID. 那么Pi的vote能够用pair(Zi, i)表示，Zi是Pi的highest zxid，也是lastest zxid。 那么两个vote比较大小的准则是：    (Zi, i) >= (Zj, j) ： Zi > Zj 或者( Zi = Zj && i >= j ) 每个peer server都有一个惟一的ID，且都知道其replicas中保存的latest zxid，那么全部的peer就会以必定顺序进行排序。 <h4>3.5 Fast leader election持久化</h4> 在fast leader election过程当中，不会对任何数据进行持久化，不会把过程当中产生的值写入到disk中。包括epoch number和ID但在fast leader election会使用已经持久化的latest zxid。 <h4>3.6 Fast leader election过程和伪码</h4> 进行Fast leader election的先决条件： 一、每一个peer server都知道其余peer的ip地址，并知道peer server的总数。 二、每一个peer server一开始都是发起一个vote，选取本身为leader。向其余全部的peer server发送vote的notification，并等待回复。 三、根据peer server的状态处理vote notification或则notifincation的回复. 若是peer server处于election状态，那么peer server会收到其余peer server的vote，若是收到的vote值更大，那么peer server会更新其vote。 若是peer server不处于election状态，那么peer server会更新其所看到的leader-follower关系。 无论哪一种状况下，当peer server检测到大部分peers持有相同的vote时，那么它会返回 Fast leader election逻辑伪代码 主要有两个逻辑分支： 一、正常过程，vote的notification的回复的peer server的状态为election 二、另外过程，vote的notification的回复的peer server的状态为leading/following 执行leader election的状况较为复杂，多是一个服务器节点新加入到zookeeper集群中。也多是zookeeper集群刚启动，你们都处于leader election状态。以上两个逻辑分支能处理这些状况。 ***初始化vote和peer server状态*** 1 Peer P: 2 timeout <---T0 // use some reasonable timeout value 3 ReceivedVotes <--- 0; OutOfElection <--- 0; // key-value mappings where keys are server ids 4 P:state <--- election;  P:vote <---(P:lastZxid; P:id);  P:round <--- P:round + 1 1-4是初始化过程，设置超时时间，receivedVotes是收到的vote noficaton回复。 进入election状态，根据lastZxid和ServerID生成一个vote，vote的epoch自增。 ReceivedVotes做为一个结果集合，在收到全部vote后，进行表决。OutOfElection用于保存状态为leading/folling的rspvote，用于表决先存在的leader/follower是否有效。 5 Send notification (P:vote, P:id, P:state, P:round) to all peers 向全部的peer server发送notification，一个notification包括vote，id，peer state，和vote的epoch number。 ***开始接收notification回复的循环处理*** 6 while P:state = election do 7     n <---(null if P:queue = 0; for timeout milliseconds, otherwise pop from P:queue) 8     if n = null then 9          Send notification (P:vote, P:id, P:state, P:round) to all peers 10        timeout <---(2* timeout), unless a predefined upper bound has been reached 8-10是当notification回复为空时，有两种状况，一种是信令发送出去回复超时，第二种是没有创建于peer server的链接. 若是是第一种状况，那么从新发送notification；若是是第二种状况，那么创建与peer server的tcp链接. 11    else if n:state = election then //当nofication回复不为空，且peer server的状态也是election时 12         if n:round > P:round then 13               P:round <--- n:round 14               ReceivedVotes <---0 15               if n:vote > (P:lastZxid; P:id) then P:voteßn:vote 16               else P:vote <---(P:lastZxid; P:id) 17               Send notification (P:vote, P:id, P:state, P:round) to all peers 这个逻辑分支是notification回复中resvote的epoch要大于vote 的epoch（说明回复中的peer vote的zxid > vote的zxid），那么vote失效了，须要更新vote，比较回复中的两个vote值的大小，选择值大的vote，而后从新发送notification。 18         else if n:round = P:round and n:vote > P:vote then 19              P:vote <--- n:vote 20              Send notification (P:vote, P:id, P:state, P:round) to all peers       当回复中的rspvote的epoch等于vote的epoch，但rspvote > vote，那么更新vote信息       而后从新将vote向全部的peer server发送。 21          else if n:round < P:round then goto line 6      Resvote的epoch小于vote的epoch，那么这个回复是无效的，        直接忽略，继续下一个循环。 22          Put(ReceivedVotes(n:id); n:vote; n:round)     将rspvote放入到ReceivedVotes中。 23         if  ReceivedVotes = SizeEnsemble then 24                DeduceLeader(P.vote.id);  return P.vote      若是已经收到了全部peer server的vote，若是vote中的leaderID == currentPeer自己，      那么currPeer为leader，结束并返回这次vote结果。 25         else if P.vote has a quorum in ReceivedVotes                        and there are no new notifications within T0 milliseconds then 26                DeduceLeader(P.vote.id);  return P.vote        若是收到超过半数peer server的vote，那么vote中的leaderID == currentPeer自己，           那么currPeer为leader，结束并返回这次vote结果. 27          end      逻辑分支1总结：          若是rspvote中epoch > vote epoch，更新epoch和vote后从新发起vote          若是rspvote中epoch < vote epoch，无效rspvote          其余，都保存在结果集合中，若是有rspvote>vote，那么将vote更新到rspvote；等待全部rspvote都收到，那么vote的值应该为结果集合中最大值，若是结果集合超过半数，那么这次vote生效，leader为vote中的serverID。若是serverID为自己的serverID，那么currpeer的状态为leader不然为follower 28    else // state of n is LEADING or FOLLOWING 当rspvote的状态为following或leading，说明vote以外已经存在了一个leader，那么此段逻辑主要是分红两部分：一部分是vote的表决；另外一部分是vote以外的leader/follower表决. 29         if n:round = P:round then 30             Put(ReceivedVotes(n.id); n:vote; n:round) 31             if n:state = LEADING then 32                 DeduceLeader(n:vote:id); return n:vote 33             else if n:vote:id = P:id and n:vote has a quorum in ReceivedVotes then 34                 DeduceLeader(n:vote:id); return n:vote 35             else if n:vote has a quorum in ReceivedVotes and the voted peer n:vote:id is in                      state LEADING and n:vote:id 2 OutOfElection then 36                  DeduceLeader(n:vote:id); return n:vote 37             end 38         end 以上部分是vote的表决，以上的逻辑跟代码中不符合，代码中的逻辑是： 若是rspvote的epoch==vote的epoch，放入到receivedVots中，若是rspvote的状态是leader 且集合中的rspvote超过半数，那么vote的表决的leader就是rspvote的leader。 39         Put(OutOfElection(n:id); n:vote; n:round) 40         if n:vote:id = P:id and n:vote has a quorum in OutOfElection then 41             P:round <--- n:round 42             DeduceLeader(n:vote:id); return n:vote 43         else if n:vote has a quorum in OutOfElection and the voted peer n:vote:id is in state                      LEADING and n:vote:id 2 OutOfElection then 44             P:round <--- n:round 45             DeduceLeader(n:vote:id); return n:vote 46          end 以上部分是对vote以外的leader/follower进行表决，OutOfElection是用来存放状态为leader/follow的rspstate，若是OutOfElection的rspvote超过半数，那么说明election以外的leader./follow是有效地， 47  end    逻辑分支2总结：这部分是考虑到可能有部分peer server维持leader/follower的状态，部分peer server处于election状态，若是维持leader/follower状态的peer server数据过半，那么leader/follower就是有效地。或者vote的epoch等于leader的epoch，那么若是有半数以上的rspvote，那么当前的leader/follower也是有效的。 <h3>4 Discovery and synchronization</h3> 在broadcast阶段，zookeeper集群必须有一个leader peer，zookeeper集群是primary/backup模式，那么leader就是primary。Discovery和synchronization这两个阶段的做用就是将所有的zookeeper节点带入到一个最终一致的状态，特别是当从crash中恢复时。这两个阶段组成了zab的recovery部分，对于容许多个独立事务的状况下，保证事务的顺序起着关键做用。 无论在discovery、synchronization仍是broadcast，一旦发生错误，那么均可以回到leader election过程。 用户若是须要使用zookeeper服务，那么必须链接一个zookeeper节点。用户向链接的服务器提交写操做，而后zab协议层会执行一个broadcast；假如用户向follower提交写操做，那么follower会把写操做发送给leader；若是leader收到写操做，leader会执行，而后向全部follower扩散这个写操做对应的数据更新。读操做能够由与用户相链接的zookeeper节点直接完成。用户能够经过发送sync命令保证数据副本的更新。 在zab协议中，zxid(transaction identifiers)对于实现顺序一致性十分关键。在zookeeper中事务能够用(v, z)表示，v是新状态（znode），z则是zxid，一个identifier。那么一个zxid也是一个pair(e, c)，e是一个primary Pe（能够理解为leader）的epoch number，c是一个整数值，做为计数器使用。Primary每产生一个新的事务，那么计数器c就会+1。 当一个新的epoch开始时，一个新的leader会被激活，此时c会被设置为0，e会在前一个epoch的值上+1。 在代码实现中e是zxid的高32位，c是zxid的低32位。 如下四个变量构成了一个peer的持久化状态： 一、History：已经被接受的事务提案(transaction proposal)。 二、acceptedEpoch：最近收到的NEWEPOCH信令中的epoch number。 三、currentEpoch：最近收到的NEWLEADER信令中的epoch number。 四、lastZxid：history中的最近的zxid。 <h3>5 Discovery过程</h3> 在这个阶段，followers会跟他们的将来预期中的leader进行通讯，准leader会收集accepted follower(已经创建链接的)的latest transactions，这个阶段的目的是发现quorum peer server中的highest histroy transaction，而后创建一个新的epoch，这样就能够防止previous leader不会commit 新的proposals（由于previous leader的epoch已通过期了）。 在discovery阶段的开始，一个follower peer会创建于准leader的leader-follower connection。 Follower同时只是链接一个leader。假如一个peer P不是leading状态，其余peer会考虑p是一个准leader，任何其余leader-follower链接都会被p拒绝；一样leader-follower链接的拒绝或其余的failure能将follower从新带入到leader election状态。 1 Follower F: 2 Send the message FOLLOWERINFO(F:acceptedEpoch) to L 3   upon receiving NEWEPOCH(e0) from L do 4      if e0 > F:acceptedEpoch then 5          F:acceptedEpoch <--- e0 // stored to non-volatile memory 6          Send ACKEPOCH(F:currentEpoch; F:history; F:lastZxid) to L 7           goto Phase 2 8      else if e0 < F:acceptedEpoch then 9           F:state <--- election and goto Phase 0 (leader election) 10     end 11 end 这个过程是follower端，follower向准leader发送FOLLOWERINFO信令，告诉leader本身的信息，最重要的就是把accepted epoch发送给leader。而后接收leader的NEWLEADER信令，NEWLEADER信令中带有new epoch（这个epoch表示这这一轮过程，每一次创建leader-follower关系，都会有一个新的epoch来惟一标识，与previous leader-follower进行区分）。Follower检查这个new epoch是否有效，若是有效，follower更新自身的epoch并回复一个ACKEPOCH，上报当前follower的状态，进入下一个阶段。若是无效，那么follower会从新跳到leader electoin阶段。 12 Leader L: 13 upon receiving FOLLOWERINFO(e) messages from a quorum Q of connected followers do 14      Make epoch number e0 such that e0 > e for all e received through FOLLOWERINFO(e) 15      Propose NEWEPOCH(e0) to all followers in Q 16 end 17 upon receiving ACKEPOCH from all followers in Q do 18      Find the follower f in Q such that for all f0 2 Q n ffg: 19          either f0:currentEpoch < f:currentEpoch 20          or (f0:currentEpoch = f:currentEpoch) ^ (f0:lastZxid _z f:lastZxid) 21      L:history <--- f:history  // stored to non-volatile memory 22      goto Phase 2 23 end 这个是leader端的recovery过程，leader会生产一个new epoch，首先接收全部follower的epoch，肯定new epoch要大于全部的follower epoch。而后向全部follower发送NEWEPOCH信令，将new epoch下发到全部的follower中。 等待follower的ACKEPOCH回复，若是全部的follower的currEpoch和zxid都小于等于leader的currEpoch和zxid，那么进入下一个过程。 <h3>6 Synchronization过程</h3> 这个过程是将follower的数据副本与准leader的历史数据进行同步，使得zookeeper集群的数据处于一致的状态。同步的方向是准leader向follower同步。同步的过程以下：leader与follower进行通讯，发送NEWLEADER信令，带有历史事务的highest zxid；follower收到这些信令后，决定是否更新历史事务，而后响应leader。当leader看到quorum follower的响应后，就会向它们发送commit信令。在这以后leader就创建完成了。 1 Leader L: 2 Send the message NEWLEADER(e0;L:history) to all followers in Q 3 upon receiving ACKNEWLEADER messages from some quorum of followers do 4      Send a COMMIT message to all followers 5      goto Phase 3 6 end 这是leader端的过程，发送NEWLEADER，而后接受响应，最后发送commit，至此leader创建完毕。 7 Follower F: 8 upon receiving NEWLEADER(e0;H) from L do 9      if F:acceptedEpoch = e0 then 10         atomically 11             F:currentEpoch <--- e0 // stored to non-volatile memory 12             for each (v; z) in H, in order of zxids, do 13                  Accept the proposal (e0; (v; z)) 14             end 15             F:history <---H // stored to non-volatile memory 16         end 17         Send an ACKNEWLEADER(e0;H) to L 18     else 19          F:state <--- election and goto Phase 0 20     end 21 end 22 upon receiving COMMIT from L do 23      for each outstanding transaction (v; z) in F:history, in order of zxids, do 24          Deliver (v; z) 25      end 26      goto Phase 3 27 end 这是follower端的流程，先是收到NEWLEADER信令，而后原子地更新epoch和历史事务，发送ACKNEWLEADER信令响应leader；而后等待commit信令，收到commit信令后进行处理，进入下一个阶段。 <h3>7 代码实现的Recovery phase</h3> 在实现discovery和synchronization时，没有严格分红两个阶段进行实现，在实现时进行了一些优化，合并成一个阶段实现，那么这个阶段就是recovery phase；recovery阶段就是将全部的zookeeper集群的数据副本进入到最终一致性地状态中，且创建出一个具备最高highest zxid的leader。 在实现中，第0阶段的fast leader election与第一阶段discovery紧密结合在一块儿，faster leader election在实现时作了一个优化，它会选择出一个most up-to-date的history（我的理解就是选择出一个具备最新的commit事务的peer server），那么这样的一个leader被选举出来后，在第一阶段就不须要去与followers通讯去发现latest history。 那么既然在fast leader election中包括了discovery阶段的责任，那么这个discovery阶段就能够被忽略，因此在实现时就将discovery和synchornization阶段合并成一个recovery阶段。这个阶段是在fast leader election以后，且认为leader拥有lastest history。 伪码： 1 Leader L: 2 L:lastZxid <--- (L:lastZxid:epoch + 1; 0) 3 upon receiving FOLLOWERINFO(f:lastZxid) message from a follower f do 4      Send NEWLEADER(L:lastZxid) to f 5      if f:lastZxid  <=  L:history:lastCommittedZxid then 6          if f:lastZxid  <=  L:history:oldThreshold then 7              Send a SNAP message with a snapshot of the whole database of L 8          else 9              Send a DIFF({committed transaction (v; z) in L:history : f:lastZxid < z}) 10        end 11     else 12         Send a TRUNC(L:history:lastCommittedZxid) message to f 13     end 14 end 15 upon receiving ACKNEWLEADER messages from some quorum of followers do 16     goto Phase 3 // Algorithm 3 17 end 以上是leader端的流程，先生存一个新的zxid和epoch，接收follower的FOLLOWERINFO信令（包含follower的lastzxid），而后向follower发送NEWLEADER(包含leader的zxid)。而后根据FOLLOWERINFO中带有的lastzxid对follower进行更新。分红三种状况……. History.lastCommittedZxid是最新committed的历史事务。History.oldThreshold是过久的历史提案，比leader上一次snapshot的时间还久。见2.6.2关于TRUNC的说明。 第一种状况是TRUNC，follower丢弃从leader.latestZxid到follower.lasterZxid之间的提案。 第二种状况是DIFF，follower接收新的提案从follower.lasterZxid到leader.lasterZxid之间的新提案。 第三种状况是SNAP，follower中的提案太旧，leader将snap更新到follower上。 18 Follower F: 19 Connect to its prospective leader L 20 Send the message FOLLOWERINFO(F:lastZxid) to L 21 upon L denies connection do 22     F:state <--- election and goto Phase 0 23 end 24 upon receiving NEWLEADER(newLeaderZxid) from L do 25     if newLeaderZxid:epoch < F:lastZxid:epoch then 26         F:state <--- election and goto Phase 0 27     end 28     upon receiving a SNAP, DIFF, or TRUNC message do 29         if got TRUNC(lastCommittedZxid) then 30             Abort all proposals from lastCommittedZxid to F:lastZxid 31         else if got DIFF(H) then 32             Accept all proposals in H, in order of zxids, then commit all 33         else if got SNAP then 34             Copy the snapshot received to the database, and commit the changes 35         end 36         Send ACKNEWLEADER 37         goto Phase 3 // Algorithm 3 38    end 39 end 以上是follower的流程，首先是向leader链接，而后发送FOLLOWERINFO信令，若是leader拒绝链接，那么follower从新回到leader election阶段。接收NEWLEADER信令，若是信令中带有的epoch无效（小于follower的epoch），那么follower从新回到leader election状态。 而后接收SNAP/DIFF/TRUNC信令，同步数据副本和zxid，最后回复ACKNEWLEADER信令。进入到下一个阶段。 这个同步的目的是让全部数据副本都进入一个最终一致性状态。为了达到这个目的，任何副本中的committed transactions必须以一样一种顺序，甚至已经被提交的transaction但没有被任何一个peer节点committ的事务必须被抛弃。SNAP和DIFF用于保证各个副本中的committed事务的顺序一致性；而TRUNC用于处理已经被提交但没有被committed的事务。 <h3>8 Broadcast phase</h3> Zookeeper peer之间的双向通道使用TCP链接实现，TCP通讯的FIFO序列化特性对于实现broadcast协议相当重要。 假如没有发生崩溃，那么peers会一直停留在broadcast阶段。第三阶段中只能有一个leader。 Broadcast的过程是leader与follower之间的一个两阶段的提交过程（two-phase commit） 一、 leader与follower的通信通道(communication channel)是一个FIFO，全部都是是按顺序处理。 二、 leader收到一个request后，会生成一个propose。而后执行两阶段提交. <a href="http://static.oschina.net/uploads/img/201312/22185922_040N.png"><img title="wps_clip_image-14769" style="border-top-width: 0px; display: inline; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px" height="126" alt="wps_clip_image-14769" src="http://static.oschina.net/uploads/img/201312/22185922_KHTe.png" width="244" border="0" /></a> Broadcast的伪码和流程 1 Leader L: 2 upon receiving a write request v do 3     Propose (e0; (v; z)) to all followers in Q, where z = (e0; c), such that z succeeds all zxid        values previously broadcast in e0 (c is the previous zxid's counter plus an increment of one) 4 end 5 upon receiving ACK((e0; (v; z))) from a quorum of followers do 6     Send COMMIT(e0; (v; z)) to all followers 7 end 以上是leader处理的两阶段提交的流程：首先leader受到写请求v，而后生成一个提案(e,(v,z))，向全部follower发送此提案的内容，而后等待follower的ack；若是ack超过半数，那么提案成立。向全部follower下发commit提案的命令。 8 // Reaction to an incoming new follower: 9 upon receiving FOLLOWERINFO(e) from some follower f do 10     Send NEWEPOCH(e0) to f 11     Send NEWLEADER(e0;L:history) to f 12 end 13 upon receiving ACKNEWLEADER from follower f do 14     Send a COMMIT message to f 15     Q <--- Q 并集 {f} 16 end 以上是一个新follower加入leader的流程：首先leader收到FOLLOWERINFO信令，而后向new follower发送NEWEPOCH信令，再发送NEWLEADER信令给new follower；等待new follower的ACKNEWLEADER，最后发送commit，至此new follower就加入到了集群中。 17 Follower F: 18 if F is leading then Invokes ready(e0) 19 upon receiving proposal (e0; (v; z)) from L do 20     Append proposal (e0; (v; z)) to F:history 21     Send ACK((e0; (v; z))) to L 22 end 23 upon receiving COMMIT(e0; (v; z)) from L do 24     while there is some outstanding transaction (v0; z0) in F:history such that z0 < z do 25         Do nothing (wait) 26     end 27     Commit (deliver) transaction (v; z) 28 end 这是follower的broadcast流程：接收到leader的提案，而后将提案写入到history中，而后发送响应。等待leader的commit信令，收到后执行commit 提案。 <h3>9 Zab所存在的问题</h3> <h4>9.1 acceptedEpoch和currentEpoch的做用</h4> 在recovery开始阶段，准leader甚至在与大部分follower成功创建链接以前就增长其epoch（包括在lastZxid内）值。由于在recovery阶段，follower在发现其epoch值要比准leader大时，会返回到leader election阶段。那么当准leader失去leader地位，并成为previous leader（其epoch比准leader要小1）的一个follower，那么准leader会发现previous leader的epoch值比其要小，那么它会返回到leader election阶段。这个现象会致使此peer一直在recovery阶段和leader election阶段之间循环。 因此使用lastZxid来存储epoch number，没有对一个tried epoch（我的理解是一个准leader在尝试成为leader时使用的epoch）和一个joined epoch（一个成功的leader所使用的epoch）进行区分。使用acceptedEpoch和currentEpoch的目的就是在于防止此类问题的发生。 <h4>9.2 Abandon follower proposal</h4> 假设一个集合{p1, p2, p3}，全部的peers都处于broadcast阶段，且都已经同步到了最新的committed事务，事务的ID是（e= 1, c= 3），p1为leader；一个新的提案，事务ID为（1, 4）已经被leader p1发出，但在p2和p3收到事务以前，p1就已经发生了崩溃（好比已经放到socket缓存区中），那么{p2, p3}会从新回到leader election，并选举出一个新的leader。当p1恢复正常了，此时p2已经成为了leader；那么根据fast leader election，在recovery阶段p2会将epoch设置为2（p2.latestZxid = （2, 0）），那么在broadcast阶段，已经新的提案已经被quorum接收和commit，它的zxid为（2， 1）。在这个时候leader p2的history.lastCommittedZxid = (2, 1)，而且p2的history.OlderThreshold = （1, 1）；那么p1从新启动后，p1会执行fast leader election，而后发现其余peer已经创建leader-follower关系，且p2是leader，那么p1会向发送FOLLOWERINFO（p1.latestZxid = （1, 4））。 在这种状况下， p1.lastestZxid(1,4) < p2.history.lastCommittedZxid(2, 1) && p2.history.oldThreshold（1, 1）< p1.lastestZxid (1, 4)，那么这种状况下leader p2须要向p1发送TRUNC信令，让follower放弃uncommitted proposal(1, 4)。 做者zy，QQ105789990node