建立统一的Id,一个通讯组中只初始化一次。建立的Id被分发给通讯组中的全部应用。html
建立通讯组中每一个应用的communicator。每一个应用在通讯过程当中须要绑定本身的communicator。git
Creates a new communicator (multi thread/process version). rank must be between 0 and nranks-1 and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling ncclCommInitRank. ncclCommInitRank implicitly syncronizes with other ranks, so it must be called by different threads/processes or use ncclGroupStart/ncclGroupEnd.github
对通讯组中的每一个communicator,须要分别调用collective操做。当操做进入到cuda stream中排队时函数就返回。collective须要每一个进程/线程进行独立操做;或者在单线程中使用组语句(ncclGroupStart/ncclGroupEnd)。支持in-place模式(sendbuf=recvbuf)多线程
Collective communication operations must be called separately for each ommunicator in a communicator clique. They return when operations have been enqueued on the CUDA stream. Since they may perform inter-CPU synchronization, each call has to be done from a different thread or process, or need to use Group Semantics.app
reduce分布式
Reduces data arrays of length count in sendbuff into recvbuff using op operation. recvbuff may be NULL on all calls except for root device. root is the rank (not the CUDA device) where data will reside after the operation is complete. In-place operation will happen if sendbuff == recvbuff.ide
Broadcast函数
Copies count values from root to all other devices. root is the rank (not the CUDA device) where data resides before the operation is started. This operation is implicitely in place.测试
All-Reduceui
Reduces data arrays of length count in sendbuff using op operation, and leaves identical copies of result on each recvbuff. In-place operation will happen if sendbuff == recvbuff.
Reduce-Scatter
All-Gather
Each device gathers sendcount values from other GPUs into recvbuff, receiving data from rank i at offset i*sendcount. Assumes recvcount is equal to nranks*sendcount, which means that recvbuff should have a size of at least nranks*sendcount elements. In-place operations will happen if sendbuff == recvbuff + rank * sendcount.
当在单线程中操做多个GPU时,须要使用组操做进行不一样ranks/devices间通讯的约束(保证在cpu同步时不冲突)。经过使用ncclGroupStart 和 ncclGroupEnd,保证组内相同操做的进行。ncclGroupStart将全部的操做放入队列,ncclGroupEnd等待队列中全部操做的完成(在collective操做中ncclGroupEnd只保证把全部的操做都放到cuda stream中,不等待操做完成)。 组操做能够在collective操做和ncclCommInitRank中被使用。
When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for different ranks/devices into a single call. Grouping NCCL calls as being part of the same collective operation is done using ncclGroupStart and ncclGroupEnd. ncclGroupStart will enqueue all collective calls until the ncclGroupEnd call, which will wait for all calls to be complete. Note that for collective communication, ncclGroupEnd only guarantees that the operations are enqueued on the streams, not that the operation is effectively done. Both collective communication and ncclCommInitRank can be used in conjunction of ncclGroupStart/ncclGroupEnd.
ncclResult_t ncclGroupStart()
组开始操做,其后的操做不使用cpu同步.
start a group call. All subsequent calls to NCCL may not block due to inter-CPU synchronization.
ncclResult_t ncclGroupEnd()
组结束操做,阻塞到全部从ncclGroupStart开始的操做完成在返回.
End a group call. Wait for all calls since ncclGroupStart to complete before returning.
ncclResult_t ncclCommDestroy(ncclComm_t comm)
释放comm资源.
Frees resources associated with communicator object
const char* ncclGetErrorString(ncclResult_t result)
得到错误信息结果
Returns a human-readable error message.
ncclResult_t ncclCommCount(const ncclComm_t comm, int* count)
得到通讯组中所有的rank数
Gets the number of ranks in the communicator clique.
ncclResult_t ncclCommCuDevice(const ncclComm_t comm, int* device)
得到当前通讯communicator对应的device
Returns the cuda device number associated with the communicator.
ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) 得到当前通讯communicator对应的rank值
Returns the user-ordered "rank" associated with the communicator.
设计思路:
采用在线程内各自初始化本身communicator的方法进行初始化(在主线程中建立ncclid,该ncclid对全局线程可见)。当某一个线程调用merge操做失败时,查看是否由于某个线程退出引发的。
若是由于某个线程退出引发merge失败,这时每一个线程从新初始化本身的communicator,并进行上一步的merge操做(该次初始化时device已经减小,至关于从新建立communicator)
设计思路:
采用在进程内各自初始化本身communicator的方法进行初始化(初始化时,0号进程使用tpc协议广播ncclid到所有进程)。当某一个进程调用merge操做失败时,查看是不是由于有进程退出引发的。 若是由于某个进程退出引发merge失败,这时每一个进程从新初始化本身的communicator,并进行上一步的merge操做(该次初始化时device已经减小,至关于从新建立communicator)