PyTorch系列 | 如何加快你的模型训练速度呢？

时间 2019-11-06

标签 pytorch 系列如何加快模型训练速度繁體版

原文原文链接

原题 | Speed Up your Algorithms Part 1 — PyTorchhtml

做者 | Puneet Groverpython

译者 | kbsc13("算法猿的成长"公众号做者)git

原文 | towardsdatascience.com/speed-up-yo…github

声明 | 翻译是出于交流学习的目的，欢迎转载，但请保留本文出于，请勿用做商业或者非法用途算法

前言

本文将主要介绍如何采用 cuda 和 pycuda 检查、初始化 GPU 设备，并让你的算法跑得更快。缓存

PyTorch 是 torch 的 python 版本，它是 Facebook AI 研究组开发并开源的一个深度学习框架，也是目前很是流行的框架，特别是在研究人员中，短短几年已经有追上 Tensorflow 的趋势了。这主要是由于其简单、动态计算图的优势。微信

pycuda 是一个 python 第三方库，用于处理 Nvidia 的 CUDA 并行计算 API 。app

本文目录以下：框架

如何检查 cuda 是否可用？
如何获取更多 cuda 设备的信息？
在 GPU 上存储 Tensors 和运行模型的方法
有多个 GPU 的时候，如何选择和使用它们
数据并行
数据并行的比较
torch.multiprocessing

本文的代码是用 Jupyter notebook，Github 地址为：异步

nbviewer.jupyter.org/github/Pune…

1. 如何检查 cuda 是否可用？

检查 cuda 是否可用的代码很是简单，以下所示：

import torch
torch.cuda.is_available()
# True
复制代码

2. 如何获取更多 cuda 设备的信息？

获取基本的设备信息，采用 torch.cuda 便可，但若是想获得更详细的信息，须要采用 pycuda 。

实现的代码以下所示：

import torch
import pycuda.driver as cuda
cuda.init()
## Get Id of default device
torch.cuda.current_device()
# 0
cuda.Device(0).name() # '0' is the id of your GPU
# Tesla K80
复制代码

或者以下所示：

torch.cuda.get_device_name(0) # Get name device with ID '0'
# 'Tesla K80'
复制代码

这里写了一个简单的类来获取 cuda 的信息：

# A simple class to know about your cuda devices
import pycuda.driver as cuda
import pycuda.autoinit # Necessary for using its functions
cuda.init() # Necesarry for using its functions

class aboutCudaDevices():
    def __init__(self):
        pass
    
    def num_devices(self):
        """返回 cuda 设备的数量"""
        return cuda.Device.count()
    
    def devices(self):
        """获取全部可用的设备的名称"""
        num = cuda.Device.count()
        print("%d device(s) found:"%num)
        for i in range(num):
            print(cuda.Device(i).name(), "(Id: %d)"%i)
            
    def mem_info(self):
        """获取全部设备的总内存和可用内存"""
        available, total = cuda.mem_get_info()
        print("Available: %.2f GB\nTotal: %.2f GB"%(available/1e9, total/1e9))
        
    def attributes(self, device_id=0):
        """返回指定 id 的设备的属性信息"""
        return cuda.Device(device_id).get_attributes()
    
    def __repr__(self):
        """输出设备的数量和其id、内存信息"""
        num = cuda.Device.count()
        string = ""
        string += ("%d device(s) found:\n"%num)
        for i in range(num):
            string += ( " %d) %s (Id: %d)\n"%((i+1),cuda.Device(i).name(),i))
            string += (" Memory: %.2f GB\n"%(cuda.Device(i).total_memory()/1e9))
        return string

# You can print output just by typing its name (__repr__):
aboutCudaDevices()
# 1 device(s) found:
# 1) Tesla K80 (Id: 0)
# Memory: 12.00 GB
复制代码

若是想知道当前内存的使用状况，查询代码以下所示：

import torch
# Returns the current GPU memory usage by 
# tensors in bytes for a given device
# 返回当前使用的 GPU 内存，单位是字节
torch.cuda.memory_allocated()
# Returns the current GPU memory managed by the
# caching allocator in bytes for a given device
# 返回当前缓存分配器中的 GPU 内存
torch.cuda.memory_cached()
复制代码

清空 cuda 缓存的代码以下所示：

# Releases all unoccupied cached memory currently held by
# the caching allocator so that those can be used in other
# GPU application and visible in nvidia-smi
# 释放全部非占用的内存
torch.cuda.empty_cache()
复制代码

但须要注意的是，上述函数并不会释放被 tensors 占用的 GPU 内存，所以并不能增长当前可用的 GPU 内存。

3. 在 GPU 上存储 Tensors 和运行模型的方法

若是是想存储变量在 cpu 上，能够按下面代码所示这么写：

a = torch.DoubleTensor([1., 2.])
复制代码

变量 a 将保持在 cpu 上，并在 cpu 上进行各类运算，若是但愿将它转换到 gpu 上，须要采用 .cuda ，能够有如下两种实现方法

# 方法1
a = torch.FloatTensor([1., 2.]).cuda()
# 方法2
a = torch.cuda.FloatTensor([1., 2.])
复制代码

这种作法会选择默认的第一个 GPU，查看方式有下面两种：

# 方法1
torch.cuda.current_device()
# 0

# 方法2
a.get_device()
# 0
复制代码

另外，也能够在 GPU 上运行模型，例子以下所示，简单使用 nn.Sequential 定义一个模型：

sq = nn.Sequential(
         nn.Linear(20, 20),
         nn.ReLU(),
         nn.Linear(20, 4),
         nn.Softmax()
)
复制代码

而后放到 GPU 上运行：

model = sq.cuda()
复制代码

怎么判断模型是否在 GPU 上运行呢，能够经过下述方法查看模型的参数是否在 GPU 上来判断：

# From the discussions here: discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda
# 参考 https://discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda/180

next(model.parameters()).is_cuda
# True
复制代码

4. 有多个 GPU 的时候，如何选择和使用它们

假设有 3 个 GPU ，咱们能够初始化和分配 tensors 到任意一个指定的 GPU 上，代码以下所示，这里分配 tensors 到指定 GPU 上，有 3 种方法：

初始化 tensor 时，指定参数 device
.to(cuda_id)
.cuda(cuda_id)

cuda0 = torch.device('cuda:0')
cuda1 = torch.device('cuda:1')
cuda2 = torch.device('cuda:2')

# 若是只是采用 .cuda() 方法，默认是放到 cuda:0 的 GPU 上
# 下面是 3 种实现方法
x = torch.Tensor([1., 2.], device=cuda1)
# Or
x = torch.Tensor([1., 2.]).to(cuda1)
# Or
x = torch.Tensor([1., 2.]).cuda(cuda1)

# 修改默认的设备方法，输入但愿设置为默认设备的 id
torch.cuda.set_device(2) 
# 调用环境变量 CUDA_VISIBLE_DEVICES，能够设置想采用的 GPU 的数量和哪几个 GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,2"
复制代码

当你有多个 GPU 的时候，就能够将应用的工做划分，但这里存在相互之间交流的问题，不过若是不须要频繁的交换信息，那么这个问题就能够忽略。

实际上，还有另外一个问题，在 PyTorch 中全部 GPU 的运算默认都是异步操做。但在 CPU 和 GPU 或者两个 GPU 之间的数据复制是须要同步的，当你经过函数 torch.cuda.Stream() 建立本身的流时，你必须注意这个同步问题。

下面是官方文档上一个错误的示例：

cuda = torch.device('cuda')
# 建立一个流
s = torch.cuda.Stream()  
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # because sum() may start execution before normal_() finishes!
    # sum() 操做可能在 normal_() 结束前就能够执行了
    B = torch.sum(A)
复制代码

若是想彻底利用好多 GPU，应该按照以下作法：

将全部 GPU 用于不一样的任务或者应用；
在多模型中，每一个 GPU 应用单独一个模型，而且各自有预处理操做都完成好的一份数据拷贝；
每一个 GPU 采用切片输入和模型的拷贝，每一个 GPU 将单独计算结果，并将结果都发送到同一个 GPU 上进行进一步的运算操做。

5. 数据并行

数据并行的操做要求咱们将数据划分红多份，而后发送给多个 GPU 进行并行的计算。

PyTorch 中实现数据并行的操做能够经过使用 torch.nn.DataParallel。

下面是一个简单的示例。要实现数据并行，第一个方法是采用 nn.parallel 中的几个函数，分别实现的功能以下所示：

复制(Replicate)：将模型拷贝到多个 GPU 上；
分发(Scatter)：将输入数据根据其第一个维度(一般就是 batch 大小)划分多份，并传送到多个 GPU 上；
收集(Gather)：从多个 GPU 上传送回来的数据，再次链接回一块儿；
并行的应用(parallel_apply)：将第三步获得的分布式的输入数据应用到第一步中拷贝的多个模型上。

实现代码以下所示：

# Replicate module to devices in device_ids
replicas = nn.parallel.replicate(module, device_ids)
# Distribute input to devices in device_ids
inputs = nn.parallel.scatter(input, device_ids)
# Apply the models to corresponding inputs
outputs = nn.parallel.parallel_apply(replicas, inputs)
# Gather result from all devices to output_device
result = nn.parallel.gather(outputs, output_device)
复制代码

实际上，还有一个更简单的也是经常使用的实现方法，以下所示，只需一行代码便可：

model = nn.DataParallel(model, device_ids=device_ids)
result = model(input)
复制代码

6. 数据并行的比较

根据文章 medium.com/@iliakarman… 以及 Github：github.com/ilkarman/De… 获得的不一样框架在采用单个 GPU 和 4 个 GPU 时运算速度的对比结果，以下所示：

从图中能够看到数据并行操做尽管存在多 GPU 之间交流的问题，可是提高的速度仍是很明显的。而 PyTorch 的运算速度仅次于 Chainer ，但它的数据并行方式很是简单，一行代码便可实现。

7. torch.multiprocessing

torch.multiprocessing 是对 Python 的 multiprocessing 模块的一个封装，而且百分比兼容原始模块，也就是能够采用原始模块中的如 Queue 、Pipe、Array 等方法。而且为了加快速度，还添加了一个新的方法--share_memory_()，它容许数据处于一种特殊的状态，能够在不须要拷贝的状况下，任何进程均可以直接使用该数据。

经过该方法，能够共享 Tensors 、模型的参数 parameters ，能够在 CPU 或者 GPU 之间共享它们。

下面展现一个采用多进程训练模型的例子：

# Training a model using multiple processes:
import torch.multiprocessing as mp
def train(model):
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters
model = nn.Sequential(nn.Linear(n_in, n_h1),
                      nn.ReLU(),
                      nn.Linear(n_h1, n_out))
model.share_memory() # Required for 'fork' method to work
processes = []
for i in range(4): # No. of processes
    p = mp.Process(target=train, args=(model,))
    p.start()
    processes.append(p)
for p in processes: 
    p.join()
复制代码

更多的使用例子能够查看官方文档：

pytorch.org/docs/stable…

参考：

欢迎关注个人微信公众号--算法猿的成长，或者扫描下方的二维码，你们一块儿交流，学习和进步！