论文笔记：Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

时间 2019-11-18

标签论文笔记 auto deeplab hierarchical neural architecture search semantic image segmentation 繁體版

原文原文链接

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
2019-03-18 14:45:44node

Paper：https://arxiv.org/pdf/1901.02985 git

Offical TensorFlow Code: https://github.com/tensorflow/models/blob/master/research/deeplab/core/nas_network.py github

PyTorch Code: https://github.com/Dawars/auto_deeplab-pytorch 网络

Video Tutorial (韩语): https://www.youtube.com/watch?v=ltlhQXHGzgE app

做者主页（Liang-Chieh Chen）：http://liangchiehchen.com/ ide

另一个关于 NAS 作语义分割的工做是：Nekrasov, Vladimir, Hao Chen, Chunhua Shen, and Ian Reid. "Fast Neural Architecture Search of Compact Semantic Segmentation Models via Auxiliary Cells." arXiv preprint arXiv:1810.10804 (2018). 函数

本文首次将 Neural Architecture Search（NAS）引入到 semantic segmentation 领域，自动搜索网络结果，用于语义分割。优化

3. Architecture Search Space：编码

本节描述了咱们的双层等级结构搜索空间。对于 inner cell level，咱们从新利用了前人的工做，保持一致。对于 outer network level，在对许多工做进行总结和观察以后，做者提出一种新的搜索空间。spa

3.1 Cell Level Search Space：

做者定义 cell 为一个小的全卷机模块，一般重复不少次，以造成整个的神经网络。具体来讲，一个 cell 是一个 directed acyclic graph，包含 B 个 blocks。

每一个 block 是一个 two-branch structure，将 2 个输入tensors 映射为 1 个输出 tensor。在 cell l 中的 Block i 多是由五元组指定的（I1, I2, O1, O2, C），其中 I1，I2 是输入 tensor 的选择，O1，O2 是 layer types 的选择，C 是用于组合 the two branches 的单独输出，以构成该 block 的输出 tensor，$H_i^l$。该 cell 的输出 tensor $H^l$ 仅仅是该模块输出 tensors 的简单组合{$H_1^l, ... , H_B^l$}。

可能的输入 tensors $I_i^l$ 的集合，包含前一个 cell $H^{l-1}$ 的输出，前前个 cell $H^{l-2}$，以及前一个 block 在当前 cell {H_1^l, ... , H_i^l} 的输出。因此，咱们在一个 cell 中，添加越多的 blocks，下一个 block 就可能会有更多的输入来源。

可能的 layer types，O，包含下列 8 个操做符，都与当前的 CNNs 紧密相关：

对于可能的组合操做函数，C，做者这里仅采用 element-wise addition。

3.2　Network Level Search Space：

在图像分类的 NAS framework 中，一旦一个 cell structure 被发现，整个的网络结构是用预先定义的模型来获得的。因此，the network-level 不是结构搜索的一部分，因此，其搜索空间从未被探索过。

这种预先定义的模式是很是简答和直观的：一些 “Normal cells” （Cells that keep the spatial resolution of the feature tensor）经过添加 “reduction cells” （cells that divide the spatial resolution by 2 and multiply the number of filters by 2）被单独的分离。这种保持 downsampling 的策略，在图像分类的任务上是合理的。可是，在 dense image prediction 中，保持高分辨率一样重要，从而致使了更多的网络层次。

在进行 dense image prediction 的众多网络结构中，咱们注意到以下两个原则是一致的：

1. the spatial resolution of the next layer is either twice as large, or twoice as small, or remains the same; (下一层的分辨率要么是两倍大，两倍小，或者保持不变)

2. the smallest spatial resolution is downsampled by 32. （最小的空间分辨率降低为32）

服从这些公共的准则，咱们提出以下的网络级别的搜索空间。网络的开始是一个 two-layer “stem” structure，每一次以幅度 2 来下降空间分辨率。在那以后，总共有 L layers 未知的空间分辨率，最大降低幅度为 4，最小的分辨率被下采样了 32. 因为每一层在空间分辨率上最多两个不一样，在 stem 以后的第一层能够被将分辨率 4 或者 8. 咱们在图 1中，展现了咱们的网络级别搜索空间。咱们的目标是在这 L层路径上，找到一个较好的 path。

在图 2 中，咱们代表：做者所提出的 search space 是一种 general 的方法，足够 cover 到不少流行的网络设计。在将来工做中，做者打算将该搜索空间，拓展到甚至包含 U-Net 结构。

因为本文既考虑了 cell level architecture ，又考虑到了 cell level architecture，因此，咱们的搜索任务，相对于前人的工做，则更加具备挑战性以及 general。

4. Methods：

咱们首先介绍 a continuous relaxation of the discrete architecture，而后介绍如何如何经过优化来实现结构化搜索，而后是在搜索结束后，如何编码回一个离散的结构。

4.1 Continuous Relaxation of Architecture：

4.1.1 Cell Architecture：

做者采用前人提出的连续松弛，每个 block 的输出向量是和全部的 hidden states 相连的：

此外，咱们用其连续的松弛 $\hat{O_{j->i}}$ 来估计每个 $O_{j->i}$ ，其定义以下：

其中，

另外，是 normalized scalars associated with each operator, 用 softmax 函数能够很容易的实现。

回顾 3.1 小节，咱们获得 cell level update 的方式：

4.2 Network Archtiecture:

在一个 cell 中，全部的 tensor 都拥有相同的 spatial size，以确保公式（1， 2）中加权求和。然而，就像图 1所示，tensors 可能在 network level 包含不一样的 size，因此，为了设置连续的松弛，每个 layer l 将会最多包含四个 hidden states，上标符号表示 spatial resolution。

咱们设计 network level 连续松弛，以准确的匹配搜索空间。咱们给图 1 的每个灰色的箭头加一个 scalar，因而，network level 的 update 能够定义为：

其中，s = 4, 8, 16, 32 and l = 1,2, ... , L. 参数 $\beta$ 归一下以下：

也是用 softmax 的方式进行。

公式（6）代表如何将 two-level hierarchical 的连续松弛进行集合。特别的，$\beta$ 控制着 the outer network level，因此，依赖于空间尺寸和 layer index。$\beta$ 的每一个 scalar 都控制了一个完整的 $\alpha$ 集合，然而 $\alpha$ 指定了 the same architecure that depends on neither spatial size nor layer index。

如图 1 所示，ASPP （Astrous Spatial Pyramid Pooling）modules 对第 L-th layer 的每个空间分辨率的都连接了（atrous rates 能够调整）。他们的输出，在 sum 以前，是 bilinear upsample 到原始的分辨率，以产生预测。

4.2 Optimization：

将该连续的松弛引入进来的优点是：这些 scalar 控制了不一样隐层状态的连接强度（controlling the connection stength between different hidden states），are now part of the differentiable computation graph. 因此，这能够经过 gradient descent 的方法来进行有效的优化。做者采用 first-order approximation，将训练数据分为两个集合 trainA 和 trainB。其依次优化过程以下：

其中，损失函数 L 是依赖于语义分割的交叉熵。

4.3 Decoding Discrete Architecture:

Cell Architecture: 做者解码该离散 cell architecture，首先，对每个 block，保持 2 个最强的 predecessors，而后，经过 argmax 来选择最像的操做符。

Network Architecture：公式（7）代表：the "outgoing probability" at each of the blue nodes in Fig. 1 sums to 1. 实际上，$\beta$ 值能够表示为：沿着不一样“时间步骤（layer number）”，不一样“state”（Spatial resolution）之间的转移几率（“transition probability”）。直观的来讲，咱们的目标是：从头至尾，找到一个 path，使其得到最大化的几率（maximum probability）。该路径能够有效的经过 the classic Viterbi algorithm，来进行解码。

5. Experimental Results:

在本节中，做者首先介绍了接收搜索的具体实现细节，以及搜索的结果。而后，介绍了语义分割在多个benchmark 数据集上的结果。

5.1 Architecture Search Implementation Details：

做者考虑到 12 层的网络，而且设置一个 cell 中的 B = 5 blocks，该 network level search space 有 $2.9*10^4$ 个独特的 path，cell structure 的个数为 $5.6*10^{14}$。因此，联合的，等级搜索空间的大小为 $10^{19}$。

做者采用经常使用的套路，即：double the number of filters，当下降 feature tensor 的 width 和 height 时。图1中的每一个绿色节点，都有 downsample rate s，拥有 B*F*s output filters，其中，F 是 filter multiplier 控制着模型的容量。在结构搜索的过程当中，咱们设置 F = 8。stride 为 2 的 convolution 被用于全部的 s/2 到 s 的链接，都用于下降分辨率大小和增长滤波器的个数。在 1*1 的卷积后，用 bilinear upsampling 来用于 2s -> s 的连接，都用于增长分辨率和下降滤波器的个数。

ASPP module 拥有 5 个分支：one 1*1 convolution, three 3*3 convolution （不一样的空洞率），以及 pooled image feature. 在搜索的过程当中，咱们简化 ASPP 使其拥有 3 branches，经过仅适用一个 3*3 convolution （空洞率为 96/s）。每一个 ASPP 分支产生的滤波器的个数为 B*F*s。

咱们在 Cityscapes dataset 上进行网络结构的搜索进行语义分割。具体来讲，做者随机的从 512*1024 的图像上裁剪出 321*312 的图像。而后随机的从 train_fine 中选择通常图像放到 trainA 中，剩下的通常做为 trainB。本文的一个亮点是：整个网络结构的搜索过程仅仅在 P100 GPU 上搜索一天就完成了。做者尝试了优化更多的时间，可是并未见到效果有显著的提高。图4，展现了验证集精度的稳定变化曲线。

5.2 语义分割结果：