分布式图数据库 Nebula Graph 中的集群快照实践

时间 2020-02-10

原文原文链接

1 概述

1.1 需求背景

图数据库 Nebula Graph 在生产环境中将拥有庞大的数据量和高频率的业务处理，在实际的运行中将不可避免的发生人为的、硬件或业务处理错误的问题，某些严重错误将致使集群没法正常运行或集群中的数据失效。当集群处于没法启动或数据失效的状态时，从新搭建集群并从新倒入数据都将是一个繁琐并耗时的工程。针对此问题，Nebula Graph 提供了集群 snapshot 的建立功能。git

Snapshot 功能须要预先提供集群在某个时间点 snapshot 的建立功能，以备发生灾难性问题时用历史 snapshot 便捷地将集群恢复到一个可用状态。github

1.2 术语

本文主要会用到如下术语：shell

**StorageEngine：**Nebula Graph 的最小物理存储单元，目前支持 RocksDB 和 HBase，在本文中只针对 RocksDB。
Partition：Nebula Graph 的最小逻辑存储单元，一个 StorageEngine 可包含多个 Partition。Partition 分为 leader 和 follower 的角色，Raftex 保证了 leader 和 follower 之间的数据一致性。
GraphSpace：每一个 GraphSpace 是一个独立的业务 Graph 单元，每一个 GraphSpace 有其独立的 tag 和 edge 集合。一个 Nebula Graph 集群中可包含多个 GraphShpace。
checkpoint：针对 StorageEngine 的一个时间点上的快照，checkpoint 能够做为全量备份的一个 backup 使用。checkpoint files是 sst files 的一个硬链接。
snapshot：本文中的 snapshot 是指 Nebula Graph 集群的某个时间点的快照，即集群中全部 StorageEngine 的 checkpoint 的集合。经过 snapshot 能够将集群恢复到某个 snapshot 建立时的状态。
wal：Write-Ahead Logging ，用 raftex 保证 leader 和 follower 的一致性。

2 系统构架

2.1 系统总体架构

2.2 存储系统结构关系

2.3 存储系统物理文件结构

[bright2star@hp-server storage]$ tree
.
└── nebula
    └── 1
        ├── checkpoints
        │   ├── SNAPSHOT_2019_12_04_10_54_42
        │   │   ├── data
        │   │   │   ├── 000006.sst
        │   │   │   ├── 000008.sst
        │   │   │   ├── CURRENT
        │   │   │   ├── MANIFEST-000007
        │   │   │   └── OPTIONS-000005
        │   │   └── wal
        │   │       ├── 1
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 2
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 3
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 4
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 5
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 6
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 7
        │   │       │   └── 0000000000000000233.wal
        │   │       ├── 8
        │   │       │   └── 0000000000000000233.wal
        │   │       └── 9
        │   │           └── 0000000000000000233.wal
        │   └── SNAPSHOT_2019_12_04_10_54_44
        │       ├── data
        │       │   ├── 000006.sst
        │       │   ├── 000008.sst
        │       │   ├── 000009.sst
        │       │   ├── CURRENT
        │       │   ├── MANIFEST-000007
        │       │   └── OPTIONS-000005
        │       └── wal
        │           ├── 1
        │           │   └── 0000000000000000236.wal
        │           ├── 2
        │           │   └── 0000000000000000236.wal
        │           ├── 3
        │           │   └── 0000000000000000236.wal
        │           ├── 4
        │           │   └── 0000000000000000236.wal
        │           ├── 5
        │           │   └── 0000000000000000236.wal
        │           ├── 6
        │           │   └── 0000000000000000236.wal
        │           ├── 7
        │           │   └── 0000000000000000236.wal
        │           ├── 8
        │           │   └── 0000000000000000236.wal
        │           └── 9
        │               └── 0000000000000000236.wal
        ├── data

3 处理逻辑分析

3.1 逻辑分析

Create snapshot 由 client api 或 console 触发， graph server 对 create snapshot 的 AST 进行解析，而后经过 meta client 将建立请求发送到 meta server 。 meta server 接到请求后，首先会获取全部的 active host ，并建立 adminClient 所需的 request 。经过 adminClient 将建立请求发送到每一个 StorageEngine ，StorageEngine 收到 create 请求后，会遍历指定 space 的所有 StorageEngine，并建立 checkpoint ，随后对 StorageEngine 中的所有 partition 的 wal 作 hardlink。在建立 checkpoint 和 wal hardlink 时，由于已经提早向全部 leader partition 发送了 write blocking 请求，因此此时数据库是只读状态的。数据库

由于 snapshot 的名称是由系统的 timestamp 自动生成，因此没必要担忧 snapshot 的重名问题。若是建立了没必要要的 snapshot，能够经过 drop snapshot 命令删除已建立的 snapshot。api

3.2 Create Snapshot

3.3 Create Checkpoint

4 关键代码实现

4.1 Create Snapshot

folly::Future<Status> AdminClient::createSnapshot(GraphSpaceID spaceId, const std::string& name) {
    // 获取全部storage engine的host
    auto allHosts = ActiveHostsMan::getActiveHosts(kv_);
    storage::cpp2::CreateCPRequest req;
    
    // 指定spaceId，目前是对全部space作checkpoint，list spaces 工做已在调用函数中执行。
    req.set_space_id(spaceId);
    
    // 指定 snapshot name，已有meta server根据时间戳产生。
    // 例如：SNAPSHOT_2019_12_04_10_54_44
    req.set_name(name);
    folly::Promise<Status> pro;
    auto f = pro.getFuture();
    
    // 经过getResponse接口发送请求到全部的storage engine.
    getResponse(allHosts, 0, std::move(req), [] (auto client, auto request) {
        return client->future_createCheckpoint(request);
    }, 0, std::move(pro), 1 /*The snapshot operation only needs to be retried twice*/);
    return f;
}

4.2 Create Checkpoint

ResultCode NebulaStore::createCheckpoint(GraphSpaceID spaceId, const std::string& name) {
    auto spaceRet = space(spaceId);
    if (!ok(spaceRet)) {
        return error(spaceRet);
    }
    auto space = nebula::value(spaceRet);
    
    // 遍历属于本space中的全部StorageEngine
    for (auto& engine : space->engines_) {
        
        // 首先对StorageEngine作checkpoint
        auto code = engine->createCheckpoint(name);
        if (code != ResultCode::SUCCEEDED) {
            return code;
        }
        
        // 而后对本StorageEngine中的全部partition的last wal作hardlink
        auto parts = engine->allParts();
        for (auto& part : parts) {
            auto ret = this->part(spaceId, part);
            if (!ok(ret)) {
                LOG(ERROR) << "Part not found. space : " << spaceId << " Part : " << part;
                return error(ret);
            }
            auto walPath = folly::stringPrintf("%s/checkpoints/%s/wal/%d",
                                                      engine->getDataRoot(), name.c_str(), part);
            auto p = nebula::value(ret);
            if (!p->linkCurrentWAL(walPath.data())) {
                return ResultCode::ERR_CHECKPOINT_ERROR;
            }
        }
    }
    return ResultCode::SUCCEEDED;
}

5 用户使用帮助

5.1 CREATE SNAPSHOT

CREATE SNAPSHOT 即对整个集群建立当前时间点的快照，snapshot 名称由 meta server 的 timestamp 组成。bash

在建立过程当中可能会建立失败，当前版本不支持建立失败的垃圾回收的自动功能，后续将计划在 metaServer 中开发 cluster checker 的功能，将经过异步线程检查集群状态，并自动回收 snapshot 建立失败的垃圾文件。微信

当前版本若是 snapshot 建立失败，必须经过 DROP SNAPSHOT 命令清除无效的 snapshot。架构

当前版本不支持对指定的 space 作 snapshot，当执行 CREATE SNAPSHOT 后，将对集群中的全部 space 建立快照。<br />CREATE SNAPSHOT 语法：异步

CREATE SNAPSHOT

如下为笔者建立 3 个 snapshot 的例子：函数

(user@127.0.0.1) [default_space]> create snapshot;
Execution succeeded (Time spent: 28211/28838 us)

(user@127.0.0.1) [default_space]> create snapshot;
Execution succeeded (Time spent: 22892/23923 us)

(user@127.0.0.1) [default_space]> create snapshot;
Execution succeeded (Time spent: 18575/19168 us)

咱们用 5.3 说起的 SHOW SNAPSHOTS 命令看下如今有的快照

(user@127.0.0.1) [default_space]> show snapshots;
===========================================================
| Name                         | Status | Hosts           |
===========================================================
| SNAPSHOT_2019_12_04_10_54_36 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_42 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_44 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
Got 3 rows (Time spent: 907/1495 us)

从上 SNAPSHOT_2019_12_04_10_54_36 可见 snapshot 名同 timestamp 有关。

5.2 DROP SNAPSHOT

DROP SNAPSHOT 即删除指定名称的 snapshot，能够经过 SHOW SNAPSHOTS 命令获取 snapshot 的名称，DROP SNAPSHOT 既能够删除有效的 snapshot，也能够删除建立失败的 snapshot。

语法：

DROP SNAPSHOT name

笔者删除了 5.1 成功建立的 snapshot SNAPSHOT_2019_12_04_10_54_36 ，并用SHOW SNAPSHOTS 命令查看现有的 snapshot。

(user@127.0.0.1) [default_space]> drop snapshot SNAPSHOT_2019_12_04_10_54_36;
Execution succeeded (Time spent: 6188/7348 us)

(user@127.0.0.1) [default_space]> show snapshots;
===========================================================
| Name                         | Status | Hosts           |
===========================================================
| SNAPSHOT_2019_12_04_10_54_42 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_44 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
Got 2 rows (Time spent: 1097/1721 us)

5.3 SHOW SNAPSHOTS

SHOW SNAPSHOTS 可查看集群中全部的 snapshot，能够经过 SHOW SNAPSHOTS 命令查看其状态（VALID 或 INVALID）、名称、和建立 snapshot 时全部 storage Server 的 ip 地址。<br />语法：

SHOW SNAPSHOTS

如下为一个小示例：

(user@127.0.0.1) [default_space]> show snapshots;
===========================================================
| Name                         | Status | Hosts           |
===========================================================
| SNAPSHOT_2019_12_04_10_54_36 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_42 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
| SNAPSHOT_2019_12_04_10_54_44 | VALID  | 127.0.0.1:77833 |
-----------------------------------------------------------
Got 3 rows (Time spent: 907/1495 us)

6 注意事项

当系统结构发生变化后，最好马上 create snapshot，例如 add host、drop host、create space、drop space、balance 等。
当前版本暂未提供用户指定 snapshot 路径的功能，snapshot 将默认建立在 data_path/nebula 目录下。
当前版本暂未提供 snapshot 的恢复功能，须要用户根据实际的生产环境编写 shell 脚本实现。实现逻辑也比较简单，拷贝各 engineServer 的 snapshot 到指定的文件夹下，并将此文件夹设置为 data_path，启动集群便可。

7 附录

最后，附上 Nebula Graph GitHub 地址：https://github.com/vesoft-inc/nebula 若是你在使用 Nebula Graph 过程当中遇到任何问题，欢迎 GitHub 联系咱们或者加入微信交流群，请联系微信号：NebulaGraphbot