cudaErrorCudartUnloading问题排查及建议方案

时间 2019-11-11

标签 cudaerrorcudartunloading 问题排查建议方案繁體版

原文原文链接

原文请猛戳这里html

敲黑板划重点——顺求异构计算/高性能计算/CUDA/ARM优化类开发职位linux

最近一段时间一直在负责作我厂神经网络前向框架库的优化，前几天接了一个bug report，报错信息大致是这样的：git

Program hit cudaErrorCudartUnloading (error 29) due to "driver shutting down" on CUDA API call to cudaFreeHost.

一样的库连接出来的可执行文件，有的会出现这种问题有的不会，一开始让我很天然觉得是使用库的应用程序出了bug。排除了这种可能以后，这句话最后的cudaFreeHost又让我想固然地觉得是个内存相关的问题，折腾了一阵后才发现方向又双叒叕错了。并且我发现，不管我在报错的那段代码前使用任何CUDA runtime API，都会出现这个错误。
后来在网上查找相关信息，如下的bug report虽然没有具体解决方案，但类似的call stack让我怀疑这和我遇到的是同一个问题，并且也让我把怀疑的目光聚焦在"driver shutting down"而非cudaFreeHost上。程序员

强制阻止"driver shutting down"？

首先一个看似理所固然的思路是：咱们可否在使用CUDA API时防止CUDA driver不被shutdown呢？问题在于"driver shutting down"究竟指的是什么？若是从cudaErrorCudartUnloading的字面意思来说，极可能是指cuda_runtime的library被卸载了。
因为咱们用的是动态连接库，因而我尝试在报错的地方前加上dlopen强制加载libcuda_runtime.so。改完后立刻发现不对，若是是动态库被卸载，理应是调用CUDA API时发现相关symbol都没有定义才对，而不该该是能够正常调用动态库的函数、而后返回error code这样的runtime error现象。
此外，我经过strace发现，还有诸如libcuda.so、libnvidia-fatbinaryloader.so之类的动态库会被加载，都要试一遍并不现实。况且和CUDA相关的动态库并很多（可参考《NVIDIA Accelerated Linux Graphics Driver README and Installation Guide》中的“Chapter 5. Listing of Installed Components”），不一样的程序依赖的动态库也不尽相同，上述作法即便可行，也很难通用。github

无独有偶，在nvidia开发者论坛上也有开发者有相似的想法，被官方人士否认了：apache

For instance, can I have my class maintain certain variables/handles that will force cuda run time library to stay loaded.
No. It is a bad design practice to put calls to the CUDA runtime API in constructors that may run before main and destructors that may run after main.api

如何使CUDA runtime API正常运做？

对于CUDA应用程序开发者而言，咱们一般是经过调用CUDA runtime API来向GPU设备下达咱们的指令。因此首先让咱们来看，在程序中调用CUDA runtime API时，有什么角色参与了进来。我从Nicholas Wilt的《The CUDA Handbook》中借了一张图：缓存

{% img http://galoisplusplus.coding.... %}安全

咱们能够看到，主要的角色有：运行在操做系统的User Mode下的CUDART(CUDA Runtime) library（对于动态库来讲就是上文提到的libcuda_runtime.so）和CUDA driver library（对于动态库来讲就是上文提到的libcuda.so），还有运行在Kernel Mode下的CUDA driver内核模块。众所周知，咱们的CUDA应用程序是运行在操做系统的User Mode下的，没法直接操做GPU硬件，在操做系统中有权控制GPU硬件的是运行在Kernel Mode下的内核模块（OT一下，做为CUDA使用者，咱们不多能感受到这些内核模块的存在，也它们许最有存在感的时候就是咱们赶上Driver/library version mismatch错误了XD）。在Linux下咱们能够经过lsmod | grep nvidia来查看这些内核模块，一般有管理Unified Memory的nvidia_uvm、Linux内核Direct Rendering Manager显示驱动nvidia_drm、还有nvidia_modeset。与这些内核模块沟通的是运行在User Mode下的CUDA driver library，咱们所调用的CUDA runtime API会被CUDART library转换成一系列CUDA driver API，交由CUDA driver library这个链接CUDA内核模块与其余运行在User Mode下CUDA library的中介。网络

那么，要使CUDA runtime API所表示的指令能被正常传达到GPU，就须要上述角色都能通力协做了。这就天然引起一个问题：在咱们的程序运行的时候，这些角色何时开始/结束工做？它们何时被初始化？咱们不妨strace看一下CUDA应用程序的系统调用：
首先，libcuda_runtime.so、libcuda.so、libnvidia-fatbinaryloader.so等动态库被加载。当前被加载进内核的内核模块列表文件/proc/modules被读取，因为nvidia_uvm、nvidia_drm等模块以前已被加载，因此不须要额外insmod。接下来，设备参数文件/proc/driver/nvidia/params被读取，相关的设备——如/dev/nvidia0（GPU卡0）、/dev/nvidia-uvm（看名字天然与Unified Memory有关，多是Pascal体系Nvidia GPU的Page Migration Engine）、/dev/nvidiactl等——被打开，并经过ioctl初始化设定。（此外，还有home目录下~/.nv/ComputeCache的一些文件被使用，这个目录是用来缓存PTX伪汇编JIT编译后的二进制文件fat binaries，与咱们当前的问题无关，感兴趣的朋友可参考Mark Harris的《CUDA Pro Tip: Understand Fat Binaries and JIT Caching》。）要使CUDA runtime API能被正常执行，须要完成上述动态库的加载、内核模块的加载和GPU设备设置。

但以上还只是从系统调用角度来探究的一个必要条件，还有一个条件写过CUDA的朋友应该不陌生，那就是CUDA context（若是你没印象了，能够回顾一下CUDA官方指南中讲初始化和context的部分）。咱们都知道：全部CUDA的资源（包括分配的内存、CUDA event等等）和操做都只在CUDA context内有效；在第一次调用CUDA runtime API时，若是当前设备没有建立CUDA context，新的context会被建立出来做为当前设备的primary context。这些操做对于CUDA runtime API使用者来讲是不透明的，那么又是谁作的呢？让我来引用一下SOF上某个问题下community wiki的标准答案：

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error.

这段话提供了几个信息：一是nvcc插入了一些代码来完成的CUDA context的建立和销毁所须要作的准备工做，二是CUDA context销毁以后再调用CUDA runtime API就可能会出现runtime error这样的未定义行为（Undefined Behaviour，简称UB）。

接下来让咱们来稍微深刻地探究一下。咱们有若干.cu文件经过nvcc编译后产生的.o文件，还有这些.o文件连接后生成的可执行文件exe。咱们经过nm等工具去查看这些.o文件，不难发现这些文件的代码段中都被插入了一个以__sti____cudaRegisterAll_为名字前缀的函数。咱们在gdb <exe>中对其中函数设置断点再单步调试，能够看到相似这样的call stack：

(gdb) bt
#0  0x00002aaab16695c0 in __cudaRegisterFatBinary () at /usr/local/cuda/lib64/libcudart.so.8.0
#1  0x00002aaaaad3eee1 in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() ()
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:98
#2  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#3  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#4  0x0000000000000001 in  ()
#5  0x00007fffffffe2a8 in  ()
#6  0x0000000000000000 in  ()

再执行若干步，call stack就变成：

(gdb) bt
#0  0x00002aaab16692b0 in __cudaRegisterFunction () at /usr/local/cuda/lib64/libcudart.so.8.0
#1  0x00002aaaaad3ef3e in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() (__T263=0x7c4b30)
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:97
#2  0x00002aaaaad3ef3e in __sti____cudaRegisterAll_53_tmpxft_000017c3_00000000_19_im2col_compute_61_cpp1_ii_a0760701() ()
    at /tmp/tmpxft_000017c3_00000000-4_im2col.compute_61.cudafe1.stub.c:98
#3  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#5  0x0000000000000001 in  ()
#6  0x00007fffffffe2a8 in  ()
#7  0x0000000000000000 in  ()

(gdb) bt
#0  0x00002aaaaae8ea20 in atexit () at XXX.so
#1  0x00002aaaaaaba3a3 in _dl_init_internal () at /lib64/ld-linux-x86-64.so.2
#2  0x00002aaaaaaac46a in _dl_start_user () at /lib64/ld-linux-x86-64.so.2
#3  0x0000000000000001 in  ()
#4  0x00007fffffffe2a8 in  ()
#5  0x0000000000000000 in  ()

那么CUDA context什么时候被建立完成呢？经过对cuInit设置断点能够发现，与官方指南的描述一致，也就是在进入main函数以后调用第一个CUDA runtime API的时候：

(gdb) bt
#0  0x00002aaab1ab7440 in cuInit () at /lib64/libcuda.so.1
#1  0x00002aaab167add5 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#2  0x00002aaab167ae31 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#3  0x00002aaabe416bb0 in pthread_once () at /lib64/libpthread.so.0
#4  0x00002aaab16ad919 in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#5  0x00002aaab167700a in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#6  0x00002aaab167aceb in  () at /usr/local/cuda/lib64/libcudart.so.8.0
#7  0x00002aaab16a000a in cudaGetDevice () at /usr/local/cuda/lib64/libcudart.so.8.0
...
#10 0x0000000000405d77 in main(int, char**) (argc=<optimized out>, argv=<optimized out>)

其中，和context建立相关的若干函数就在${CUDA_PATH}/include/crt/host_runtime.h中声明过：

#define __cudaRegisterBinary(X)                                                   \
        __cudaFatCubinHandle = __cudaRegisterFatBinary((void*)&__fatDeviceText); \
        { void (*callback_fp)(void **) =  (void (*)(void **))(X); (*callback_fp)(__cudaFatCubinHandle); }\
        atexit(__cudaUnregisterBinaryUtil)
       

extern "C" {
extern void** CUDARTAPI __cudaRegisterFatBinary(
  void *fatCubin
);

extern void CUDARTAPI __cudaUnregisterFatBinary(
  void **fatCubinHandle
);

extern void CUDARTAPI __cudaRegisterFunction(
        void   **fatCubinHandle,
  const char    *hostFun,
        char    *deviceFun,
  const char    *deviceName,
        int      thread_limit,
        uint3   *tid,
        uint3   *bid,
        dim3    *bDim,
        dim3    *gDim,
        int     *wSize
);
}

static void **__cudaFatCubinHandle;

static void __cdecl __cudaUnregisterBinaryUtil(void)
{
  ____nv_dummy_param_ref((void *)&__cudaFatCubinHandle);
  __cudaUnregisterFatBinary(__cudaFatCubinHandle);
}

但这些函数都没有文档，Yong Li博士写的《GPGPU-SIM Code Study》稍微详细一些，我就直接贴过来了：

The simplest way to look at how nvcc compiles the ECS (Execution Configuration Syntax) and manages kernel code is to use nvcc’s --cuda switch. This generates a .cu.c file that can be compiled and linked without any support from NVIDIA proprietary tools. It can be thought of as CUDA source files in open source C. Inspection of this file verified how the ECS is managed, and showed how kernel code was managed.

Device code is embedded as a fat binary object in the executable’s .rodata section. It has variable length depending on the kernel code.

For each kernel, a host function with the same name as the kernel is added to the source code.

Before main(..) is called, a function called cudaRegisterAll(..) performs the following work:

• Calls a registration function, cudaRegisterFatBinary(..), with a void pointer to the fat binary data. This is where we can access the kernel code directly.

• For each kernel in the source file, a device function registration function, cudaRegisterFunction(..), is called. With the list of parameters is a pointer to the function mentioned in step 2.

As aforementioned, each ECS is replaced with the following function calls from the execution management category of the CUDA runtime API.

• cudaConfigureCall(..) is called once to set up the launch configuration.

• The function from the second step is called. This calls another function, in which, cudaSetupArgument(..) is called once for each kernel parameter. Then, cudaLaunch(..) launches the kernel with a pointer to the function from the second step.

An unregister function, cudaUnregisterBinaryUtil(..), is called with a handle to the fatbin data on program exit.

其中，cudaConfigureCall、cudaSetupArgument、cudaLaunch在CUDA7.5之后已经“过气”（deprecated）了，因为这些并非在进入main函数以前会被调用的API，咱们能够不用管。咱们须要关注的是，在main函数被调用以前，nvcc加入的内部初始化代码作了如下几件事情（咱们能够结合上面host_runtime.h头文件暴露出的接口和相关call stack来确认）：

经过__cudaRegisterFatBinary注册fat binary入口函数。这是CUDA context建立的准备工做之一，若是在__cudaRegisterFatBinary执行以前调用CUDA runtime API极可能也会出现UB。SOF上就有这样一个问题，题主在static对象构造函数中调用了kernel函数，结果就出现了"invalid device function"错误，SOF上的CUDA大神talonmies的答案就探究了static对象构造函数和__cudaRegisterFatBinary的调用顺序及其产生的问题，很是推荐一读。
经过__cudaRegisterFunction注册每一个device的kernel函数
经过atexit注册__cudaUnregisterBinaryUtil的注销函数。这个函数是CUDA context销毁的清理工做之一，前面提到，CUDA context销毁以后CUDA runtime API就极可能没法再被正常使用了，换言之，若是CUDA runtime API在__cudaUnregisterBinaryUtil执行完后被调用就有多是UB。而__cudaUnregisterBinaryUtil在何时被调用又是符合atexit规则的——在main函数执行完后程序exit的某阶段被调用（main函数的执行过程能够参考这篇文章）——这也是咱们理解和解决cudaErrorCudartUnloading问题的关键之处。

{% img http://galoisplusplus.coding.... %}

一切皆全局对象之过

吃透本码渣上述啰里啰唆的理论后，再经过代码来排查cudaErrorCudartUnloading问题就简单了。原来，竟和以前提过的SOF上的问题类似，咱们代码中也使用了一个全局static singleton对象，在singleton对象的析构函数中调用CUDA runtime API来执行释放内存等操做。而咱们知道，static对象是在main函数执行完后exit进行析构的，而以前提到__cudaUnregisterBinaryUtil也是在这个阶段被调用，这二者的顺序是未定义的。若是__cudaUnregisterBinaryUtil等清理context的操做在static对象析构以前就调用了，就会产生cudaErrorCudartUnloading报错。这种UB也解释了，为什么以前咱们的库连接出来的不一样可执行文件，有的会出现这个问题而有的不会。

解决方案

在github上搜cudaErrorCudartUnloading相关的patch，处理方式也是五花八门，这里姑且列举几种。

跳过`cudaErrorCudartUnloading`检查

好比arrayfire项目的这个patch。能够，这很佛系（滑稽）

-    CUDA_CHECK(cudaFree(ptr));
+    cudaError_t err = cudaFree(ptr);
+    if (err != cudaErrorCudartUnloading) // see issue #167
+        CUDA_CHECK(err);

干脆把可能会有`cudaErrorCudartUnloading`的CUDA runtime API去掉

好比kaldi项目的这个issue和PR。论佛系，谁都不服就服你（滑稽）

把CUDA runtime API放到一个独立的de-initialisation、finalize之类的接口，让用户在`main`函数`return`前调用

好比MXNet项目的MXNotifyShutdown（参见：c_api.cc）。佛系了辣么久总算看到了一种符合本程序员审美的“优雅”方案（滑稽）

刚好在SOF另外一个问题中，talonmies大神（啊哈，又是talonmies大神！）在留言里也表达了同样的意思，不能赞同更多啊：

The obvious answer is don't put CUDA API calls in the destructor. In your class you have an explicit intialisation method not called through the constructor, so why not have an explicit de-initialisation method as well? That way scope becomes a non-issue

上面的方案虽然“优雅”，但对于库维护者却有多了一层隐忧：万一加了个接口，使用者要撕逼呢？（滑稽）万一使用者根本就不鸟你，没在main函数return前调用呢？要说别人打开方式不对，人家还能够说是库的实现不够稳健把你批判一通呢。若是你也有这种隐忧，请接着看接下来的“黑科技”。

土法黑科技（滑稽）

首先，CUDA runtime API仍是不能放在全局对象析构函数中，那么应该放在什么地方才合适呢？毕竟咱们不知道库使用者最后用的是哪一个API啊？不过，咱们却能够知道库使用者使用什么API时是在main函数的做用域，那个时候是能够建立有效的CUDA context、正常使用CUDA runtime API的。这又和咱们析构函数中调用的CUDA runtime API有什么关系呢？你可能还记得吧，前边提到nvcc加入的内部初始化代码经过atexit注册__cudaUnregisterBinaryUtil的注销函数，咱们天然也能够如法炮制：

// 首先调用一个“无害”的CUDA runtime API，确保在调用`atexit`以前CUDA context已被建立
// 这样就确保咱们经过`atexit`注册的函数在CUDA context相关的销毁函数（例如`__cudaUnregisterBinaryUtil`）以前就被执行
// “无害”的CUDA runtime API？这里指不会形成影响内存占用等反作用的函数，我采用了`cudaGetDeviceCount`
// 《The CUDA Handbook》中推荐使用`cudaFree(0);`来完成CUDART初始化CUDA context的过程，这也是能够的
int gpu_num;
cudaError_t err = cudaGetDeviceCount(&gpu_num);

std::atexit([](){
    // 调用原来在全局对象析构函数中的CUDA runtime API
});

那么，应该在哪一个地方插入上面的代码呢？解铃还须系铃人，咱们的cudaErrorCudartUnloading问题出在static singleton对象身上，但如下singleton的惰性初始化却也给了咱们提供了一个绝佳的入口：

// OT一下，和本中老年人同样上了年纪的朋友可能知道
// 之前在C++中要实现线程安全的singleton有多蛋疼
// 有诸如Double-Checked Locking之类略恶心的写法
// 但自打用了C++11以后啊，腰不酸了,背不疼了,腿啊也不抽筋了,码代码也有劲儿了（滑稽）
// 如下实如今C++11标准中是保证线程安全的
static Singleton& instance()
{
     static Singleton s;
     return s;
}

由于库使用者只会在main函数中经过这个接口使用singleton对象，因此只要在这个接口初始化CUDA context并用atexit注册清理函数就能够辣！固然，做为一位严谨的库做者，你也许会问：不能对库使用者抱任何幻想，万一别人在某个全局变量初始化时调用了呢？Bingo！我只能说目前咱们的业务流程可让库使用者不会想这么写来恶心本身而已...（捂脸）万一真的有这么做的使用者，这种方法就失效了，使用者会遇到和前面提到的SOF某问题类似的报错。毕竟，黑科技也不是万能的啊！

后记

解决完cudaErrorCudartUnloading这个问题以后，又接到新的救火任务，排查一个使用加密狗API致使的程序闪退问题。加密狗和cudaErrorCudartUnloading两个问题看似风马牛不相及，本质居然也是类似的：又是同样的UB现象；又是全局对象；又是在全局对象构造和析构时调用了加密狗API，和加密狗内部的初始化和销毁函数的执行顺序未定义。看来，不乱挖坑仍是要有基本的常识——在使用外设设备相关的接口时，要保证在main函数的做用域里啊！

cudaErrorCudartUnloading问题排查及建议方案

强制阻止"driver shutting down"？

如何使CUDA runtime API正常运做？

一切皆全局对象之过

解决方案

跳过cudaErrorCudartUnloading检查

干脆把可能会有cudaErrorCudartUnloading的CUDA runtime API去掉

把CUDA runtime API放到一个独立的de-initialisation、finalize之类的接口，让用户在main函数return前调用

土法黑科技（滑稽）

后记

参考资料

跳过`cudaErrorCudartUnloading`检查

干脆把可能会有`cudaErrorCudartUnloading`的CUDA runtime API去掉

把CUDA runtime API放到一个独立的de-initialisation、finalize之类的接口，让用户在`main`函数`return`前调用