从新配置语义分割实验环境遇到的坑

新的框架是semseg, by hs-z https://github.com/hszhao/semsegnode

1. 安装apex报错：fatal error: gnu-crypt.h: No such file or directory

本质上是cryptacular的pip源有问题，使用conda install cryptacular便可python

2. pip install老是安装到别的虚拟环境里

这是由于当前正在使用的pip并不是当前虚拟环境里的。这里conda install会默认安装到当前虚拟环境，可是pip并不会。linux

因此使用 whereis pip查看想要的当前虚拟环境的pip程序的位置，而后使用绝对路径来执行pip install便可git

3. 关于pip和conda的源

今天是2019年6月10日，目前conda的清华源由于版权问题已经关闭，而pip的清华源仍能够正常使用。github

4. ModuleNotFoundError: No module named 'yaml'

应该是服务器

conda install pyaml

5. TypeError: Class advice impossible in Python3. Use the @implementer class decorator instead

首先切换当前的CUDA版本与pytorch的CUDA版本一致多线程

而后卸载已经安装过的apex。app

而后：框架

git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

对于多虚拟环境，可能会有些错乱。此时使用 less

whereis pip

来找到你当前虚拟环境的pip执行程序的位置。而后使用pip的绝对路径进行操做。

包括apex上面的最后一步的python也可使用其绝对路径来安装，保证必定安装到了正确的位置

不该该使用下面的这行命令来安装apex：

git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext

缘由多是我在conda虚拟环境中

参考：https://github.com/NVIDIA/apex/issues/214#issuecomment-476399539

6. 什么报错都没有，用PDB也没有。在 x = self.layer0(x) 处消失

batch size和输入图片尺寸小一些就行了

7. Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.

在已经正确安装的前提下，还坚持报这个错误，说明是个有深度的错误。

根据：https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627/2

两个解决方法：

build cpp extensions with -D_GLIBCXX_USE_CXX11_ABI=1.
build pytorch with -D_GLIBCXX_USE_CXX11_ABI=0

可是apex如何设置额外的编译参数我也不会。根据他们提供的export方法，即：

export CFLAGS="-D_GLIBCXX_USE_CXX11_ABI=1 $CFLAGS"

而后再编译apex，发如今编译过程当中这个参数仍是等于0，没有效果。

最终根据下面这段话：

The best way to solve this problem in any case is to compile Pytorch from source and use that same compiler for the extension. Then all problems go away.

决定仍是把pytorch和apex都在本机上从源代码编译一遍得了。

而后发现pytorch从源代码编译颇有困难……遇到了一堆找不到解决办法的BUG，最后想了想把pytorch安装回去吧。

以前安装pytorch和此次的途径不一样：

以前的方式是：

conda install pytorch torchvision cudatoolkit=10.0

为了加快下载速度，就不想从pytorch官方源下载，而是选择了从conda源下载。

然而在python中，使用：

torch._C._GLIBCXX_USE_CXX11_ABI

发现是True，也就是说

-D_GLIBCXX_USE_CXX11_ABI=1

不知足要求

此次使用了pytorch官方的源：

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

神奇的事情发生了，此次

-D_GLIBCXX_USE_CXX11_ABI=0了

仍是pytorch官方靠谱……conda上是收录的官方的编译包，更新的不够及时。

而后就OK了……………………

总结一下，

1. 靠谱的仍是官方，不要图省事，也不要总想着本身去编译，那样子问题更多。

2. 遇到问题要到github上相应仓库的issue去查询，这也很重要。特别是，要用英文查询。中文查询都是二手信息。

3. Google的搜索能力的确很厉害，尽可能用Google！

8. ValueError: batch_size should be a positive integer value, but got batch_size=0

在config文件里的

batch_size_val: 8  # batch size for validation during training, memory and speed tradeoff

须要设置为和GPU同样的数量，虽然不知道为何。

9. cv2.error: OpenCV(4.1.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

data_root不正确，没有读取到数据

10. Exception: process 0 terminated with signal SIGSEGV

内存不足

目前还没法解决

只能把分布式训练给关了，暂时能够运行，可是很慢

11. pip下载速度慢

linux下，修改 ~/.pip/pip.conf (没有就建立一个)，修改 index-url至tuna，内容以下：

 [global] index-url = https://pypi.tuna.tsinghua.edu.cn/simple

12. 查看pytorch对应的cuda版本

print(torch.version.cuda)

13. libSM.so.6: cannot open shared object file: No such file or directory

# https://stackoverflow.com/questions/47113029/importerror-libsm-so-6-cannot-open-shared-object-file-no-such-file-or-directo

pip install opencv-python-headless # also contrib, if needed pip install opencv-contrib-python-headless

14. fatal error: gnu-crypt.h: No such file or directory

在安装apex过程当中出现的。应该使用

conda install cryptacular

而后再安装apex

15. OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).

OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://
www.intel.com/software/products/support/.

Traceback (most recent call last): │··································
File "tool/train.py", line 456, in <module> │··································
main() │··································
File "tool/train.py", line 106, in main │··································
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn │··································
while not spawn_context.join(): │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join │··································
(error_index, name) │··································
Exception: process 2 terminated with signal SIGABRT

在新服务器上试图从新配置环境，然而遇到这个问题

解决方法：

在train.sh里添加这一行

export KMP_INIT_AT_FORK=FALSE

此时的train.sh为

#!/bin/sh PARTITION=gpu PYTHON=python dataset=$1 exp_name=$2 exp_dir=exp/${dataset}/${exp_name} model_dir=${exp_dir}/model result_dir=${exp_dir}/result config=config/${dataset}/${dataset}_${exp_name}.yaml now=$(date +"%Y%m%d_%H%M%S") mkdir -p ${model_dir} ${result_dir} cp tool/train.sh tool/train.py ${config} ${exp_dir} export PYTHONPATH=./ export KMP_INIT_AT_FORK=FALSE #sbatch -p $PARTITION --gres=gpu:8 -c16 --job-name=train \ $PYTHON -u tool/train.py \ --config=${config} \ 2>&1 | tee ${model_dir}/train-$now.log

View Code

参考：https://github.com/ContinuumIO/anaconda-issues/issues/11294

16. RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1556653114079/work/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled cuda error

不少地方（https://github.com/pytorch/pytorch/issues/23534）

说是NCCL的版本问题，因而打印版本：（在命令行里，运行python以前）

export NCCL_DEBUG=VERSION

而后执行程序，看到个人版本是2.4.8

此时，标题中的报错再也不出现……

因此这个错误是一种表象，掩盖了实际的错误

可是运行了一段时间后又自动断开了，再次运行仍是这个错误…… 很奇怪

在打印了debug信息后：加入

export NCCL_DEBUG=info

发现有一个错误：

Cuda failure 'an illegal memory access was encountered'

这个问题没有通行的解决方案，每一个人的问题都不太同样。

我忽然发现，每次到GPU3的进程开始初始化时就会报错，而后取消使用GPU3 发现错误解决了…… 难道是硬件坏了

通过排查，已经肯定只有在GPU3上有问题。运行屡次，报错不一样，这里记录一下

报错记录1：

Traceback (most recent call last): File "tool/train.py", line 480, in <module> main() File "tool/train.py", line 115, in main main_worker(args.train_gpu, args.ngpus_per_node, args) File "tool/train.py", line 290, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "tool/train.py", line 335, in train output, main_loss, aux_loss = model(input, target) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 93, in forward x = self.layer4(x_tmp) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward out = self.conv2(out) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

View Code

报错记录2：

Traceback (most recent call last): File "tool/train.py", line 480, in <module> main() File "tool/train.py", line 115, in main main_worker(args.train_gpu, args.ngpus_per_node, args) File "tool/train.py", line 290, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "tool/train.py", line 335, in train output, main_loss, aux_loss = model(input, target) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 92, in forward x_tmp = self.layer3(x) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward out = self.conv2(out) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1568696969690/work/aten/src/THC/THCGeneral.cpp:216

View Code

对于报错1，

根据 https://github.com/qqwweee/keras-yolo3/issues/332#issuecomment-517989338

应该安装补丁： https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

安装后仍是没用

对于报错记录2：根据https://github.com/huggingface/transfer-learning-conv-ai/issues/10#issuecomment-496111466

增长了export CUDA_LAUNCH_BLOCKING=1

下面是报错记录3：

Traceback (most recent call last): File "tool/train.py", line 480, in <module> main() File "tool/train.py", line 115, in main main_worker(args.train_gpu, args.ngpus_per_node, args) File "tool/train.py", line 290, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "tool/train.py", line 335, in train output, main_loss, aux_loss = model(input, target) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], **kwargs[0]) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 90, in forward x = self.layer1(x) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 90, in forward residual = self.downsample(x) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__ result = self.forward(*input, **kwargs) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward return self.conv2d_forward(input, self.weight) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

View Code

添加了下面一行后暂时能够了，

torch.backends.cudnn.benchmark = True

好了十分钟，又坏了，报错和记录3同样……

==================================================================================================================

不肯定是否是仍是第三块卡的问题，因此此次使用其余的全部卡，训练多一些时间看看

使用了别的全部的卡一块儿训练了好久都没有问题。

总结两点：

1. 只使用第三块卡会有问题

2. 单卡时，至关于很是普通的训练方式，并不会出发多线程、多进程以及分布式的代码

再尝试一下是否是能够经过软件层面解决，不行的话就只能归因于显卡坏掉了。或者服务器有问题

有人说是CUDNN的版本问题，先将CUDNN关闭：

torch.backends.cudnn.enabled = False

关闭后，在单独使用第三块卡时候的确能够运行了。

而后测试使用全部卡+关闭CUDNN。在运行了20个ITERS后报错：

[2019-09-29 20:03:08,046 INFO train.py line 404 98898] Epoch: [44/200][20/186] Data 0.001 (0.106) Batch 1.274 (1.504) Remain 12:11:15 MainLoss 0.1086 AuxLoss 0.1171 Loss 0.1555 Accuracy 0.9622. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1568696969690/work/c10/cuda/CUDACachingAllocator.cpp:569) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ffa46ff5477 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x17044 (0x7ffa47231044 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: <unknown function> + 0x1cccb (0x7ffa47236ccb in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7ffa46fe2e8d in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so) frame #4: <unknown function> + 0x1c2789 (0x7ffa7892f789 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #5: <unknown function> + 0x445d2b (0x7ffa78bb2d2b in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x445d61 (0x7ffa78bb2d61 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #7: <unknown function> + 0x1a184f (0x558f4667b84f in /home/lzx/.conda/envs/seg/bin/python) frame #8: <unknown function> + 0xfd1a8 (0x558f465d71a8 in /home/lzx/.conda/envs/seg/bin/python) frame #9: <unknown function> + 0x10e3c7 (0x558f465e83c7 in /home/lzx/.conda/envs/seg/bin/python) frame #10: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python) frame #11: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python) frame #12: <unknown function> + 0xf5777 (0x558f465cf777 in /home/lzx/.conda/envs/seg/bin/python) frame #13: <unknown function> + 0xf57e3 (0x558f465cf7e3 in /home/lzx/.conda/envs/seg/bin/python) frame #14: <unknown function> + 0xf5766 (0x558f465cf766 in /home/lzx/.conda/envs/seg/bin/python) frame #15: <unknown function> + 0x1db5e3 (0x558f466b55e3 in /home/lzx/.conda/envs/seg/bin/python) frame #16: _PyEval_EvalFrameDefault + 0x2a5a (0x558f466a7e4a in /home/lzx/.conda/envs/seg/bin/python) frame #17: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python) frame #18: _PyEval_EvalFrameDefault + 0x6a3 (0x558f466a5a93 in /home/lzx/.conda/envs/seg/bin/python) frame #19: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python) frame #20: _PyEval_EvalFrameDefault + 0x416 (0x558f466a5806 in /home/lzx/.conda/envs/seg/bin/python) frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python) frame #22: _PyFunction_FastCallKeywords + 0x387 (0x558f4663df57 in /home/lzx/.conda/envs/seg/bin/python) frame #23: _PyEval_EvalFrameDefault + 0x14dc (0x558f466a68cc in /home/lzx/.conda/envs/seg/bin/python) frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python) frame #25: PyEval_EvalCodeEx + 0x44 (0x558f465ef424 in /home/lzx/.conda/envs/seg/bin/python) frame #26: PyEval_EvalCode + 0x1c (0x558f465ef44c in /home/lzx/.conda/envs/seg/bin/python) frame #27: <unknown function> + 0x22ab74 (0x558f46704b74 in /home/lzx/.conda/envs/seg/bin/python) frame #28: PyRun_StringFlags + 0x7d (0x558f4670fddd in /home/lzx/.conda/envs/seg/bin/python) frame #29: PyRun_SimpleStringFlags + 0x3f (0x558f4670fe3f in /home/lzx/.conda/envs/seg/bin/python) frame #30: <unknown function> + 0x235f3d (0x558f4670ff3d in /home/lzx/.conda/envs/seg/bin/python) frame #31: _Py_UnixMain + 0x3c (0x558f467102bc in /home/lzx/.conda/envs/seg/bin/python) frame #32: __libc_start_main + 0xf0 (0x7ffa91705830 in /lib/x86_64-linux-gnu/libc.so.6) frame #33: <unknown function> + 0x1db062 (0x558f466b5062 in /home/lzx/.conda/envs/seg/bin/python) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) /home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File "tool/train.py", line 488, in <module> main() File "tool/train.py", line 114, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 3 terminated with the following error: Traceback (most recent call last): File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 298, in main_worker loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch) File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 351, in train scaled_loss.backward() File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/tensor.py", line 120, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

View Code

但是为何最后仍是说CUDNN有错，不是都关闭了吗？

根据：https://blog.csdn.net/qq_39938666/article/details/86611474

多是python版本有问题。因而选择和他同样的3.6.6进行尝试

从新建立了Python=3.6.6的环境，仍是不正确……

在别的卡都正确，只有使用GPU03不正确，说明代码没写错，就是卡的问题。或者就是不兼容

我尝试将这些卡调换顺序，可是报错一直都是将GPU02有问题。（之前是GPU03，后来不知道为何一直是GPU03了，会不会是电源插口有问题？）

目前的报错（单卡GPU02）是cuda runtime error (77) : an illegal memory access was encountered

根据有的回答https://ethereum.stackexchange.com/questions/65652/error-cuda-mining-an-illegal-memory-access-was-encountered

我正在尝试将CUDA升级到10.1，目前是10.0

然而并无用…… 先就用7块卡吧，

10月03日更新

今天算是解决了这个问题，误打误撞的

主要是参考了：https://github.com/pytorch/pytorch/issues/22050#issuecomment-521030783

这我的的头像我很熟悉，是pytorch论坛里常常回复别人消息的NVIDIA员工

他说除了使用conda。也要尝试使用pip安装。

因而我尝试了使用pip，将pytorch安装在别的环境里。而后别的环境和之前的环境，都再也不有问题了。

这本质上应该是CUDNN被pip安装的东西覆盖了。你们都说这个是CUDNN有问题。可能pip的这个版本正好是OK的

大神给的命令行是

pip3 install torch torchvision

可是，你们都知道pip可能会更新源的包。因此这里贴一下个人实际的下载到的包：

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting torch Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
     |████████████████████████████████| 748.9MB 68kB/s Collecting torchvision Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/e6/a564eba563f7ff53aa7318ff6aaa5bd8385cbda39ed55ba471e95af27d19/torchvision-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB)
     |████████████████████████████████| 8.8MB 1.0MB/s Requirement already satisfied: numpy in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torch) (1.17.2) Requirement already satisfied: pillow>=4.1.1 in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (6.1.0) Requirement already satisfied: six in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (1.12.0) Installing collected packages: torch, torchvision Successfully installed torch-1.2.0 torchvision-0.4.0

仅供参考