[ZooKeeper] 1 基本概念

时间 2019-12-09

原文原文链接

ZooKeeper: A Distributed Coordination Service for Distributed Applicationshtml

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.node

Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.ios

ZooKeeper ：分布式应用的分布式协调服务

ZooKeeper 是一个为分布式应用程序而设计的分布式开源的协调服务。它提供了一组简单的原语，使得分布式应用能够在它基础上实现更高层次的服务，以知足同步、配置维护、分组及命名等要求。它的设计易于编程开发，而且采用了相似于你们所熟悉的文件系统的目录树结构的数据模型。它运行在 Java 环境下，并提供了 Java 和 C 的接口。

众所周知，协调服务很难保证正确性，特别容易出现条件竞争和死锁。而 ZooKeeper 的设计目的就是为了减轻分布式应用的开发难度，从而不用再从头开始构建协调服务。

☛ ZooKeeper 是 Google Chubby 的一个开源实现，也是 Hadoop 和 HBase 的重要组件，它提供了一项基本服务：分布式锁服务，后来扩展出其它的使用方法：配置维护、组服务、分布式消息队列和分布式通知/协调等。

Design Goalsweb

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can acheive high throughput and low latency numbers.算法

The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.数据库

ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.apache

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.编程

Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.缓存

ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.安全

ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

设计目标

简单。ZooKeeper 容许各分布式进程之间能够经过一个共享的层次型命名空间来相互协调，该命名空间的组织就像一个标准的文件系统，包括若干注册的数据节点。这些节点在 ZooKeeper 中被称为 znodes，相似于文件和目录。与用于存储的传统文件系统不一样的是，ZooKeeper 的数据是保存在内存当中的，意味着 ZooKeeper 能够实现高通量和低延迟。

ZooKeeper 的实现着重于高性能、高可用和严格的顺序访问。性能方面的特色决定了 ZooKeeper 可用于大型的分布式系统，从可靠性方面来书，它能够避免发生单点故障，严格的顺序访问控制则保证了能够在客户端实现复杂的同步原语。

复制。就像它所协调的分布式进程，ZooKeeper 自己也能够经过若干主机（称为集群）进行复制。

组成 ZooKeeper 服务的各个服务器之间必须能够相互通讯。它们维护一个状态信息的内存映像，以及在持久化存储中维护着事务日志和快照。因此只要大部分服务器正常工做，这个 ZooKeeper 服务就是可用的。

多个客户端能够同时链接到一个 ZooKeeper 服务器。由客户端维护着这个 TCP 链接，经过这个链接，客户端能够发送请求、接收响应、获取监视事件以及发送心跳。若是这个链接断了，客户端就会链接到另外一台 ZooKeeper 服务器。

顺序。ZooKeeper 会为每次更新标识一个数字，表示全部 ZooKeeper 事务的顺序。后续的操做能够利用这个顺序实现更高层次的抽象功能，好比同步原语。

高效。ZooKeeper 特别适合于以读取占主导的工做负载中。ZooKeeper 能够运行在数千台机器上，而且当读写比例接近10:1时性能最佳。

ZooKeeper 所提供的服务主要是经过：数据结构（znode）+原语（关于该数据结构的一些操做）+ watcher 机制三个部分来实现的。

Data model and the hierarchical namespace

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.

数据模型和分层命名空间

ZooKeeper 的命名空间与标准的文件系统很是类似。一个命名空间就是一系列的由“/”分隔的路径。命名空间里的每一个节点都是由一个路径（Unicode字符串）惟一标识的。

Nodes and ephemeral nodes

Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].

节点和临时节点

与标准文件系统不一样的是，ZooKeeper 命名空间中的每一个节点均可以包含与自己相关或者与子节点相关的数据。兼具文件和目录两种特色。（ZooKeeper 是用来存储协调数据的，例如状态信息、配置、位置信息等，因此每一个节点上存储的数据一般都很小，在字节到千字节之间。）为简单起见，下文咱们将以 znode 表示 ZooKeeper 数据节点。

Znodes 维护着一个 stat 结构，包括数据修改、ACL 修改以及时间戳的版本号，用于缓存验证和协调更新。每一次的 znode 数据更新，版本号都会随之增长。当一个客户端接收数据的同时也会获得该数据的版本。

一个命名空间里，每一个 znode 数据的读写都是原子性的。读取操做是获取全部与 znode 相关的数据字节，写入操做则是替换全部数据。另外，每一个节点都有一个访问控制列表（Access Control List，ACL），规定了特定用户的权限，限定特定用户对目标节点能够执行的操做。

ZooKeeper 也有临时节点（ephemeral node）的概念，这些节点的生命周期依赖于建立它们的 session ，一旦 session 结束了，临时节点也将被自动删除。虽然每一个临时节点都会绑定一个客户端会话，可是它们对全部客户端仍是可见的。另外，临时节点不容许有子节点。

节点的类型在建立时就已经肯定，而且不能改变。

持久化节点的生命周期不依赖于 session，只有在客户端显式执行删除操做时才能被删除。

顺序节点是在建立节点时，在请求的路径末尾添加一个递增的计数。该计数对于此节点的父节点是惟一的。格式为“%10d”。

监视器是客户端在节点上设置 watch，当节点状态发生改变时，将会触发 watch 所绑定的操做，且 watch 只能触发一次，以后就被删除掉。

☛ 每一个 znode 由三部分组成：

stat：状态信息，描述 znode 的版本、权限等信息；
data：与该 znode 关联的数据；
children：该 znode 下的子节点；

节点类型包括：

PERSISTENT：持久化节点
PERSISTENT_SEQUENTIAL：持久化顺序编号节点
EPHEMERAL：临时节点
EPHEMERAL_SEQUENTIAL：临时顺序编号节点

Conditional updates and watches

ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].

条件更新和监视点

ZooKeeper 支持 watches（监视点） 的概念。客户端能够在一个 znode 上设置一个监视点，当 znode 发生改变时，监视点将被触发并删除，是一次性的触发器。而当监视点被触发时，客户端就会收到 znode 发生改变的通知。而且，若是客户端与 ZooKeeper 服务器之间的链接中断了，客户端会收到一个本地通知。【待定】

ZooKeeper 能够为全部的读操做（exists、getChildren、etData）设置 watch。理论上，客户端接收 watch 事件的时间要快于其看到 watch 对象状态变化的时间。

watch 是由客户端所链接的 ZooKeeper 服务器在本地进行维护，所以 watch 很容易进行设置、管理和分派。watch 分为如下两种：

data watches：当前节点数据的 watch，由 getData 和 exists 负责设置；
child watches：当前节点的子节点的 watch，由 getChildren 负责设置；

因此函数 getData、exists 和 getChildren 具备双重做用，注册触发事件和函数自己的功能。分别重载 process(Event event) 和 processResult() 来实现。

设置 watch	watch 触发器
	create		delete		setData
	znode	child	znode	child	znode
exists	NodeCreated		NodeDeleted		NodeDataChanged
getData			NodeDeleted		NodeDataChanged
getChildren		NodeChildrenChanged	NodeDeleted	NodeDeletedChanged

Guarantees

ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.

Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

保证

ZooKeeper 是很是高效简单的。由于它的目标是构建更加复杂服务（例如同步）的基础，因此它提供了一系列的保证：

顺序一致性 —— 来自客户端的更新，将会严格按照其发送的顺序被应用到 ZooKeeper 中。
原子性 —— 更新不论是成功仍是失败，其结果都是一致的，没有部分的结果。即全部事务请求的处理结果在整个集群中全部机器上的应用状况是一致的，要么整个集群中全部机器都成功应用了某一事务，要么都没有应用，必定不会出现部分机器应用了该事务，而另外一部分机器没有应用的状况。
单系统映像 —— 无论客户端链接到哪台服务器，它们都将获得相同的服务，看到的数据模型都是一致的。
可靠性 —— 一旦服务器端应用了一个更新事务，并完成对客户端的响应，那么该事务所引发的服务端状态变动将会一直保留下来，直到客户端再次覆盖更新。
实时性 —— 在必定时间范围内能够保证客户端获取的系统状态是最新的。即 ZooKeeper 并非一种强一致性，只能保证顺序一致性和最终一致性，即伪实时性。

时间和版本

（1）zxid

使 ZooKeeper 节点状态改变的每一个操做都将使节点接收到一个 zxid 格式的时间戳，该时间戳是全局有序的，便是惟一标识。若是 zxid1 小于 zxid2，那么 zxid1 对应的事务应发生在 zxid2 对应的事务以前。

czxid：节点建立时间对应的 zxid 格式时间戳
mzxid：节点修改时间对应的 zxid 格式时间戳
pzxid：该节点或其子节点的最近一次建立/删除时间对应的 zxid 格式时间戳

zxid 是一个64位数字，其高32位是 epoch 用来标识 leader 关系是否改变，每次 leader 被选举出来，就会产生一个新的 epoch。低32位是个递增的计数。

（2）version

对节点的每一个操做都将使这个节点的版本号增长。

version：节点数据版本号
cversion：子节点版本号
aversion：节点拥有的 ACL 版本号

节点属性

属性	描述
czxid	节点建立时间对应的 zxid 格式时间戳
mzxid	节点修改时间对应的 zxid 格式时间戳
ctime	节点建立时间
mtime	节点修改时间
version	节点数据版本号
cversion	子节点版本号
aversion	节点拥有的 ACL 版本号
ephemeralOwner	若是此节点为临时节点，该值表示节点拥有者的会话 id，不然为 0
dataLength	节点数据长度
numChildren	节点拥有的子节点个数
pzxid	该节点或其子节点的最近一次建立/删除时间对应的 zxid 格式时间戳

Simple API

One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:

create：creates a node at a location in the tree
delete：deletes a node
exists：tests if a node exists at a location
get data：reads the data from a node
set data：writes data to a node
get children：retrieves a list of children of a node
sync：waits for data to be propagated

简单的 API

ZooKeeper 的设计目标之一就是提供一组简单的编程接口，结果它就只支持以下操做：

方法	描述
create	在树中某个位置建立一个节点（父节点必须存在）
delete	删除一个节点（znode 没有子节点）
exists	在某个位置检查是否存在一个节点，并获取它的元数据
get data	从一个节点读取数据，getACL、getChildren、getData
set data	设置一个节点数据，setACL、setData
get children	获取一个节点的子节点列表
sync	等待数据传播（同步到其余节点）

Implementation

ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of components.

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

实现

下图显示了 ZooKeeper 服务的高层次组件。除了请求处理器（Request Processor）外，构成 ZooKeeper 服务的每台服务器本身都有一份各个组件的备份。

复制数据库（Replicated Database）是一个包含整个数据树的内存数据库。更新操做都会记录到磁盘以便恢复，而写入操做在应用到内存数据库以前会先被序列化到磁盘。
每一个 ZooKeeper 服务器均可觉得客户端提供服务，而客户端只会链接到一台服务器来提交请求。读取请求是由每一个服务器数据库的本地副本提供服务，而对于会改变服务状态的请求 —— 写请求，则是由一个约定的协议进行处理。

该约定协议规定，全部客户端的写请求都统一发送到一台服务器上，该服务器称为 leader，其他的 ZooKeeper 服务器则称为 followers，follower 会从 leader 接收消息提议并赞成实施。在 leader 发生故障时，协议的消息层（Messaging Layer）则会关注 leader 的更换，并同步到其余 followers。

ZooKeeper 采用了一个自定义的原子性消息协议。因为消息层是原子性的，因此 ZooKeeper 能够保证本地副本不会产生分歧。当 leader 接收一个写请求时，它会计算出写入操做实施后的系统状态，捕获该新状态并将其转换成一个事务。

Performance

性能

ZooKeeper release 3.2
2Ghz Xeon + 2个 SATA 15K RPM 磁盘
一个磁盘用于 ZooKeeper 的日志记录，快照则写入 OS 设备
“N Servers”表明 ZooKeeper 集群中服务器的个数
大约30台服务器模拟客户端
ZooKeeper 集群设置成 leader 不容许链接客户端

Reliability

可靠性

一台 follower 故障及恢复
另外一台 follower 故障及恢复
leader 故障
两台 followers 故障及恢复
另外一台 leader 故障

ZooKeeper 安全机制

ACL（Access Control List），ZooKeeper 提供一套完善的 ACL 权限控制机制，包括三种模式：

权限模式，Schema，开发人员经常使用。

IP：经过 ip 地址粒度进行权限控制，支持按网段分配权限，例如 192.168.1.*
Diges：最经常使用的权限控制模式，相似于"username:password"形式的权限标识，并对其进行 SHA-1 加密算法和 BASE64 编码两次编码处理。
World：最开放的权限控制模式，做为一种特殊的 diges。
Super：超级用户模式，能够进行任意操做。

权限对象：指的是权限赋予给用户或者一个指定的实体，例如 ip 地址或机器等。
权限：指那些经过权限检测后能够被容许执行的操做，包括 CREATE、DELETE、READ、WRITE 和 ADMIN。

参考说明

【参考】 ZooKeeper Overview

【参考】分布式协调器ZooKeeper3.4—概述

【参考】 ZooKeeper基本讲解 & 集群构建 & 经常使用操做指令

【参考】ZooKeeper学习第一期---Zookeeper简单介绍

by. Memento