这篇blog主要探讨常见的分布式一致性算法:git
重点在于对比这些算法的相同点和差别。思考算法的设计和取舍.github
paxos是最先提出的分布式一致性算法.
见http://lamport.azurewebsites....web
实践中, 节点每每同时承载proposer/acceptors/learner的功能.算法
提案人,A proposer sends a proposed value to a set of acceptors. 能够理解为能够执行写入的master节点。注意paxos协议并无限制Proposer的数量. 在实践中为了不中心节点生成全序proposal number的单点故障, 使用了选主协议. 参考下面的leader.segmentfault
见论文的第三节:Implementing a State Machine
必须有一个中心节点生成全序的 proposal number. 被选为主节点的proposer才可发起proposal.
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm.安全
投票人,An acceptor may accept the proposed value. 即构成majority或quorum的节点。app
被动接受选举结果的角色。能够理解为只读的从库。ssh
客户端. 发起请求和接收返回.分布式
下图来自: https://en.wikipedia.org/wiki...ide
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{Va,Vb,Vc}) | X--------->|->|->| | | Accept!(1,V) | |<---------X--X--X------>|->| Accepted(1,V) |<---------------------------------X--X Response | | | | | | |
由于全部最高的proposal number, 都是通过了majority的accept的, 因此绝对不会发生以下状况:
更低版本议题覆盖更高版本议题的状况, 如,
(proposal number 1, V1) 覆盖 (proposal number 2, V2).
paxos容许同时存在多个proposer, prepare阶段能够确保不覆盖higher proposal number.
paxos算法没有明确要求proposal number的生成是严格有序的. 也没有规定learner严格有序地按proposal number读取.
在严格有序的状况下, paxos才是线性的.
zab是zookeeper实现的分布式一致性算法.
见: http://www.cs.cornell.edu/cou...
It is important in our setting that we enable multiple outstanding ZooKeeper operations and that a prefix of operations submitted concurrently by a ZooKeeper client are committed according to FIFO order.
paxos没有直接支持线性一致性. 若是多个primaries同时执行多个事务, 事务提交的顺序是没法保证的, 这个将致使最终的结果不一致.
下图说明了这种不一致, 27A -> 28B 变成了 27C -> 28B. 若是但愿避免这种不一致, 只能限制一次只能执行一个proposal. 显然这样会致使系统吞吐低下. 若使用合并transactions的方式优化的话, 又会致使系统延迟提升. 合并transactions的size也很难选择.
在多个primaries同时执行多个事务的状况下, paxos不一样的processes在同一个sequence number下可能有不一样的值. 新的master必须对全部未learned的value走一遍paxos phases 1, 即经过majories得到正确的值.
zab 是经过单主节点, 即leader广播来实现线性, 即全序(论文中的PO, primary order)的广播.
选主, abdeliver 主节点的广播.
新纪元, 每次选主成功, 都会有有 epoch_new > epoch_current.
proposes和commit对应paxos的Promise/Accept.
proposes和commit 都包含 (e, (v, z)):
e: epoch 对应master的任期number
v: commit transaction value, 对应实际的值
z: commit transaction identifier (zxid), 对应提交版本号.
正是commit的全序线性化保证, 保证了各个节点的value变化具备线性一致性.
还没有在了论文中找到prospective leader的精确描述,有模糊描述在 V. Z AB IN DETAIL 以下:
When a process starts, it enters the ELECTION state. While in this state the process tries to elect a new leader or become a leader. If the process finds an elected leader, it moves to the FOLLOWING state and begins to follow the leader.
多数成员ack后,便可安全commit和返回client. 参照discovery阶段,由于会选择全部节点中最新的history,故,只须要多数节点ack,便可返回写入成功.
返回失败不必定意味着写入不成功,多是
写入失败不必定失败,写入成功能够确保成功。
raft 是目前最易于实现,最易懂的分布式一致性算法,和zab同样,raft必须先选出主节点。
paxos指single-decree paxos, 在工程中价值不大, 而multi-paxos的诸多细节并未给出.
从论文出发, raft考虑到了更多工程上的细节, 如日志压缩, 和改变集群节点都是zab/paxos未说起的.
同zab, raft有一个strong leader. 必须先选出主节点, 才可提供写入服务.
同zab, follower可投票, 写入主节点广播的log(by AppendEntries rpc)
同zab, 每选出一个leader, 都对应一个全序的term.
日志, 即zab的propose和commit.
AppendEntries设计得极为精巧. 同时能够做为
https://raft.github.io/
有详细的选主过程. 能够随意操做节点, 观察各个状况下的边界. 参照paxos的文本化流程, 下面是一个正常状况下的选主:
Node1 Node2 Node3 | | | timeout | | X----RequestVote---->| | X--------------------|-----RequestVote---->| |<-------vote--------X | |<-------vote--------|-------------- ------X become leader | |
Figure 6: Logs are composed of entries, which are numbered
sequentially. Each entry contains the term in which it was
created (the number in each box) and a command for the state
machine. Anentryisconsidered committed if itissafeforthat
entry to be applied to state machines.
仔细思考AppendEntries RPC的Receiver implementation:
当多数节点都返回对应success时, AppendEntries的entries便可认为是写入, 能够返回给client. 这是经过在选主时施加的额外限制来保证的, 具备更新(比较log index和term). log的节点才可赢得选举. 那么, 只要一个log在多数节点写入了(就算没有commit), 选主时, 必定会选择一个具备该entry的节点:
Figure11: Timeline for a configuration change. Dashed lines
show configuration entries that have been created but not
committed,and solidlinesshow thelatestcommittedconfigu-
ration entry. The leader first creates theC old,new configuration
entry in its log and commits it to C old,new (a majority of C old
and a majority of C new ). Then it creates the C new entry and
commits it to a majority of C new . There is no point in time in
whichC old and C new can both make decisions independently.
关键在于可以作决定的majority的切换.
gossip一致性算法和paxos/zab/raft 有较大差别, 不存在主节点, 没有线性化保证. 经过无限重试的广播, 确保变化在一段时间后, 可以同步到全部节点.
实现算法的关键在与, 如何将全部节点的gossip广播合并成一个状态, 即, 节点a能够声明节点a上有k1=v1, 节点b声明节点b上有k1=v2. 那么合并的状态是 a上有k1=v1, b上有k1=v2. 只要合并的算法是一致的, 如, 能够选择并存, 优先a或优先b, 那么收到对应广播的节点, 状态可保持一致.
由于没有线性化保证, 广播的内容不能用差量, 而应该用全量.
Active thread (peer P): Passive thread (peer Q): (1) selectPeer(&Q); (1) (2) selectToSend(&bufs); (2) (3) sendTo(Q, bufs); -----> (3) receiveFromAny(&P, &bufr); (4) (4) selectToSend(&bufs); (5) receiveFrom(Q, &bufr); <----- (5) sendTo(P, bufs); (6) selectToKeep(cache, bufr); (6) selectToKeep(cache, bufr); (7) processData(cache); (7) processData(cache) Figure 1: The general organization of a gossiping protocol.
不论是paxos, zab, raft,实现一致性的核心原理都相似:
不论是节点崩溃后继续提交,仍是由其它节点继续提交,会在多数节点达成共识.
https://en.wikipedia.org/wiki...
http://lamport.azurewebsites....
https://stackoverflow.com/que...
http://www.cs.cornell.edu/cou...
https://raft.github.io/
https://web.stanford.edu/~ous...