nvidia-nccl 学习笔记

时间 2020-01-13

标签 nvidia nccl 学习笔记繁體版

原文原文链接

NCCL 资料

NCCL1 vs NCCL2

nccl1：
nccl1支持单机多卡通讯，不支持多机通讯。
开源地址：https://github.com/NVIDIA/nccl-tests
nccl2:
nccl2支持多机通讯，在nccl1的基础上增长了多机通讯策略。多机通讯可进行通讯协议的选择，支持经过IB、TCP等协议实现多机间数据通讯。

NCCL2接口

初始化操做

Id 建立

建立统一的Id，一个通讯组中只初始化一次。建立的Id被分发给通讯组中的全部应用。html

ncclResult_t ncclGetUniqueId(ncclUniqueId* uniqueId)
建立一个被初始化函数（ncclCommInitRank）使用的Id。该函数只能被调用一次（在整个分布式计算中只能被一个地方调用），调用后产生的Id须要分发给分布式任务中其余全部的任务，而后在进行ncclCommInitRank初始化操做（该初始化操做须要使用全局统一Id）。
> Generates an Id to be used in ncclCommInitRank. ncclGetUniqueId should be called once and the Id should be distributed to all ranks in the communicator before calling ncclCommInitRank.

communicator 初始化

建立通讯组中每一个应用的communicator。每一个应用在通讯过程当中须要绑定本身的communicator。git

ncclResult_t ncclCommInitRank(ncclComm_t* comm, int nranks, ncclUniqueId commId, int rank)
多进程/多线程中建立一个新的communicator。参数重的rank必须是0到nranks-1之间，而且是惟一的。每一个rank应该对应一个已经设置的device。该函数会对每一个rank作隐式同步。该函数必须被不一样的进程、线程调用；或者在同一个线程中使用ncclGroupStart/ncclGroupEnd进行限制。

Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.github

ncclResult_t ncclCommInitAll(ncclComm_t* comm, int ndev, const int* devlist)
但进程中统一建立communicators，须要预先分配comm地址，而且传入device个数和device列表（该函数在单机通讯中使用较方便，多机通讯中不使用该函数）。
> Creates a clique of communicators (single process version). This is a convenience function to create a single-process communicator clique. Returns an array of ndev newly initialized communicators in comm. comm should be pre-allocated with size at least ndev*sizeof(ncclComm_t). If devlist is NULL, the first ndev CUDA devices are used. Order of devlist defines user-order of processors within the communicator.

合并操做

对通讯组中的每一个communicator，须要分别调用collective操做。当操做进入到cuda stream中排队时函数就返回。collective须要每一个进程／线程进行独立操做；或者在单线程中使用组语句（ncclGroupStart/ncclGroupEnd）。支持in-place模式（sendbuf=recvbuf）多线程

Collective communication operations must be called separately for each ommunicator in a communicator clique. They return when operations have been enqueued on the CUDA stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics.app

reduce分布式
1. ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)
  数据合并操做，将数据合并到root节点（root节点是rank的root，不是device的root）
Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff may be NULL on all calls except for root device. root is the rank (not the CUDA device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.ide
Broadcast函数
1. ncclResult_t ncclBcast(void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)
  广播root节点的数据到所有节点。
Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitely in place.测试
All-Reduceui
1. ncclResult_t ncclAllReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)
  在每一个节点上对所有数据作reduce操做
Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.
Reduce-Scatter
All-Gather
1. ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)
  从其余节点接受数据并存储到本地recvbuf中。接收到的数据存储的偏移位置为i*sendcount（i为rank序列）
Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.

组操做

当在单线程中操做多个GPU时，须要使用组操做进行不一样ranks／devices间通讯的约束（保证在cpu同步时不冲突）。经过使用ncclGroupStart 和 ncclGroupEnd，保证组内相同操做的进行。ncclGroupStart将全部的操做放入队列，ncclGroupEnd等待队列中全部操做的完成（在collective操做中ncclGroupEnd只保证把全部的操做都放到cuda stream中，不等待操做完成）。组操做能够在collective操做和ncclCommInitRank中被使用。

When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done. Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.

ncclResult_t ncclGroupStart()
组开始操做，其后的操做不使用cpu同步.

start a group call. All subsequent calls to NCCL may not block due to inter-CPU synchronization.
ncclResult_t ncclGroupEnd()
组结束操做，阻塞到全部从ncclGroupStart开始的操做完成在返回.

End a group call. Wait for all calls since ncclGroupStart to complete before returning.

其余操做

ncclResult_t ncclCommDestroy(ncclComm_t comm)
释放comm资源.

Frees resources associated with communicator object
const char* ncclGetErrorString(ncclResult_t result)
得到错误信息结果

Returns a human-readable error message.
ncclResult_t ncclCommCount(const ncclComm_t comm, int* count)
得到通讯组中所有的rank数

Gets the number of ranks in the communicator clique.
ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* device)
得到当前通讯communicator对应的device

Returns the cuda device number associated with the communicator.
ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) 得到当前通讯communicator对应的rank值

Returns the user-ordered "rank" associated with the communicator.

NCCL 动态扩展

单机多卡多线程动态扩展

设计思路：
采用在线程内各自初始化本身communicator的方法进行初始化（在主线程中建立ncclid，该ncclid对全局线程可见）。当某一个线程调用merge操做失败时，查看是否由于某个线程退出引发的。
若是由于某个线程退出引发merge失败，这时每一个线程从新初始化本身的communicator，并进行上一步的merge操做（该次初始化时device已经减小，至关于从新建立communicator）

测试结论：
1. 每一个线程初始化本身OK
2. merge操做过程当中若是出现某个线程退出，其余线程会处于block状态（不返回）
结论
单机多卡（多线程）动态扩展没法支持。

单机/多机多卡多进程动态扩展

设计思路：
采用在进程内各自初始化本身communicator的方法进行初始化（初始化时，0号进程使用tpc协议广播ncclid到所有进程）。当某一个进程调用merge操做失败时，查看是不是由于有进程退出引发的。若是由于某个进程退出引发merge失败，这时每一个进程从新初始化本身的communicator，并进行上一步的merge操做（该次初始化时device已经减小，至关于从新建立communicator）

测试结论： 1. server进程（TCP server端）建立ncclId，而且将该进程bcast到全部work进程（TCP client端），而后进行通讯是能够的（server进程能够不参与通讯）
2. merge操做过程当中若是出现某个进程退出，其余进程所有处于block状态（不返回），且这时候其余进程的GPU使用率是100%，cpu使用100%。
结论单机／多机多卡多进程动态扩展没法支持。