[Pytorch]docker共享内存问题

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

问题

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

出现这个错误的状况是,在服务器上的docker中运行训练代码时,batch size设置得过大,shared memory不够(由于docker限制了shm).git

根据PyTorch README:github

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

解决方案

1.这里说明PyTorch的IPC会利用共享内存,因此共享内存必须足够大,能够经过docker run --shm-size进行修改
2.经过设置 --ipc=host
3.将Dataloader的num_workers设置为0.但训练会变慢docker

yolov3 issue#283服务器

PyTorch On K8S 共享内存问题定位code

Pytorch的12个坑ip

相关文章
相关标签/搜索