apex的实践

时间 2019-11-30

标签 apex 实践繁體版

原文原文链接

apex是NVIDIA开源的用于在PyTorch框架下实现混合精度训练的模块，可以方便地进行FP16训练。c++

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.git

其API地址为 nvidia.github.io/apexgithub

安装中踩的坑

我在编译安装apex的过程当中遇到了一些问题，经过查issues来解决的。框架

使用时碰到segmentation fault学习

能够试试gcc5，能够用 conda install -c psi4 gcc-5 来安装，参考 github.com/NVIDIA/apex…ui

若是碰到"GLIBCXX_3.4.20' not found"这个问题code

能够试试找到 path_to_anaconda3/lib/libstdc++.so.6，而后链接到apex引用的路径下，或者本身加一个lib PATH。orm

若是碰到FusedLayerNorm有关的错误ip

多是和没装cuda的扩展，能够

Try a full pip uninstall apex, then cd apex_repo_dir; rm-rf build; python setup.py install --cuda_ext --cpp_ext and see if the segfault persists."

参考https://github.com/huggingface/pytorch-pretrained-BERT/issues/284

使用时的坑

AttributeError: 'NoneType' object has no attribute 'contiguous'

模型中有无用的layers(weights)(例子: github.com/FDecaYed/py…)，致使反向传递梯度后，这些weights的梯度为none，就会报“AttributeError: 'NoneType' object has no attribute 'contiguous'”的错误，能够参考https://github.com/NVIDIA/apex/issues/131

解决方案：1. 改apex的源码，让其判断梯度是否为none，2. 改模型，去掉无用的weights，第二种方法更好一些，或者等apex更新吧。

p.type().is_cuda() ASSERT FAILED at csrc/fused_adam_cuda.cpp:12

这个错误是我本身的问题，model.cuda() 应该在 FusedAdam的声明以前，否则会报这个错误。

cuda runtime error (77) : an illegal memory access

个人这个问题经过该issue的方法解决了，github.com/NVIDIA/apex… 目前还没找到缘由。

1. APEX: Some good web sites about APEX
2. APEX的安装
3. use metadataApi in apex
4. Use Asynchronous Apex
5. Oracle EBS + APEX
6. insufficient_access_on_cross_reference_entity APEX / Salesforce
7. apex环境
8. android aapt apex
9. Apex Design Patterns
10. 关于《Apex》的分析
更多相关文章...
• Thymeleaf项目实践 - Thymeleaf 教程
• 现实生活中的 XML - XML 教程
• TiDB 在摩拜单车在线数据业务的应用和实践
• ☆基于Java Instrument的Agent实现