1804.03235-Large scale distributed neural network training through online distillation.md

时间 2019-12-10

标签 1804.03235 large scale distributed neural network training online distillation.md distillation 栏目 CSS 繁體版

原文原文链接

现有分布式模型训练的模式

分布式SGD
- 并行SGD：大规模训练中，一次的最长时间取决于最慢的机器
- 异步SGD：不一样步的数据，有可能致使权重更新向着未知方向
并行多模型：多个集群训练不一样的模型，再组合最终模型，可是会消耗inference运行时
蒸馏：流程复杂
- student训练数据集的选择
  - unlabeled的数据
  - 原始数据
  - 留出来的数据

协同蒸馏

using the same architecture for all the models;
using the same dataset to train all the models; and
using the distillation loss during training before any model has fully converged.

特色
- 就算thacher和student是彻底相同的模型设置，只要其内容足够不一样，也是可以得到有效的提高的
- 便是模型未收敛，收益也是有的
- 丢掉teacher和student的区分，互相训练，也是有好处的
- 不是同步的模型也是能够的。算法

算法简单易懂，并且步骤看上去不是很复杂。app

使用out of state模型权重的解释：框架

every change in weights leads to a change in gradients, but as training progresses towards convergence, weight updates should substantially change only the predictions on a small subset of the training data;
weights (and gradients) are not statistically identifiable as different copies of the weights might have arbitrary scaling differences, permuted hidden units, or otherwise rotated or transformed hidden layer feature space so that averaging gradients does not make sense unless models are extremely similar;
sufficiently out-of-sync copies of the weights will have completely arbitrary differences that change the meaning of individual directions in feature space that are not distinguishable by measuring the loss on the training set;
in contrast, output units have a clear and consistent meaning enforced by the loss function and the training data.

因此这里彷佛是说，随机性的好处？less

一种指导性的实用框架设计：异步

Each worker trains an independent version of the model on a locally available subset of the training data.
Occasionally, workers checkpoint their parameters.
Once this happens, other workers can load the freshest available checkpoints into memory and perform codistillation.
再加上，能够在小一些的集群上使用分布式SGD。

另外论文中提到，这种方式，比起每次直接发送梯度和权重，只须要偶尔载入checkpoint，并且各个模型集群在运算上是彻底相互独立的。这个却是确实能减小一些问题。
可是，若是某个模型垮掉了，彻底没收敛呢？分布式

另外，没看出来这种框架哪里简单了，管理模型和checkpoint不是一个简单的事情。ide

实验结论

20TB的数据，有钱任性性能

论文中提到，并非机器越多，最终模型效果越好，彷佛32-128是比较合适的，更多了，模型收敛速度和性能不会更好，有时反而会有降低。
ui

论文中的实验结果2a，最好的仍是双模型并行，其次是协同蒸馏，最差的是unigram的smooth0.9，label smooth 0.99跟直接训练表现差很少，毕竟只是一个随机噪声。
另外，经过对比相同数据的协同蒸馏2b，和随机数据的协同整理，实验发现，随机数据实际上让模型有更好的表现
3在imagenet上的实验，出现了跟2a差很少的结果。
4中虽然不用非得用最新的模型，可是，协同蒸馏，使用过久远的checkpoint仍是会显著下降训练效率的。this

欠拟合的模型是有用的，可是过拟合的模型在蒸馏中可能不太有价值。
协同蒸馏比双步蒸馏能更快的收敛，并且更有效率。

3.5中介绍的，也是不少时候面临的问题，由于初始化，训练过程的参数不同等问题，可能致使两次训练出来的模型的输出有很大区别。例如分类模型，可能上次训练的在某些分类上准确，而此次训练的，在这些分类上就不许确了。模型平均或者蒸馏法能有效避免这个问题。

总结

balabalabala
实验只尝试了两个模型，多个模型的多种拓扑结构也值得尝试。

很值得一读的一个论文。