[paper reading] DenseNet

[paper reading] DenseNet

GitHub:Notes of Classic Detection Papers

本来想放到GitHub的,结果GitHub不支持公式。
没办法只能放到CSDN,但是格式也有些乱
强烈建议去GitHub上下载源文件,来阅读学习!!!这样阅读体验才是最好的
当然,如果有用,希望能给个star

topic motivation technique key element use yourself relativity
DenseNet Problem to Solve
Modifications
DenseNet Architecture
Advantages
Dense Block
Transition Layers
Growth Rate
Bottleneck Structure
Bottleneck Structure
Feature Reuse
Transition Layers
blogs
articles

Motivation

Problem to Solve

DenseNet 是在 ResNet 的基础上进行改进。

ResNet 中 identity functionweight layers output求和的方式结合,会阻碍信息的传递

个人理解:在channel维度上的拼接更能保持不同path信息的独立性(而 ResNet 会因为相加造成特征混叠的情况)

Modifications

Feature Concatenate

  • 特征拼接的方式:element-level 的求和 ==> channel-level 的拼接

Skip Connection

  • DenseNet 大幅度扩展了 skip connection,取得了一系列的优点

    训练一个完全dense的网络,然后在上面剪枝才是最好的方法,unet++如是说。

Technique

DenseNet Architecture

在这里插入图片描述

在这里插入图片描述

DenseNet的 forward propagation 的公式:

注意:该公式并不局限于一个 dense block,而在整个 DenseNet 满足

x ℓ = H ℓ ( [ x 0 , x 1 , … , x ℓ − 1 ] ) \mathbf{x}_{\ell}=H_{\ell}\left(\left[\mathbf{x}_{0}, \mathbf{x}_{1}, \ldots, \mathbf{x}_{\ell-1}\right]\right) x=H([x0,x1,,x1])

  • [ x 0 , x 1 , … , x ℓ − 1 ] \left[\mathbf{x}_{0}, \mathbf{x}_{1}, \ldots, \mathbf{x}_{\ell-1}\right] [x0,x1,,x1]

    第 0,……, ℓ − 1 \ell-1 1 层 feature map 的拼接(concatenation)

  • H ℓ H_{\ell} H

    复合的函数,依次包括三个部分:

    • Batch Normalization
    • ReLU
    • 3x3 Conv(1个)

    H ℓ ( ⋅ ) H_{\ell}(·) H() 还包括了维度压缩的过程,以提高计算效率、学习紧凑的feature representation,详见 [bottleneck layers](#bottleneck layers)

Advantages

Parameter Efficiency & Model Compactness

  • parameter efficiency ==> less overfitting参数的高效利用会一定程度上避免过拟合

    One positive side-effect of the more efficient use of parameters is a tendency of DenseNets to be less prone to overfitting.

  • feature reuse ==> model compactness

实现的方式有两个:

  • bottleneck structure
  • compression of transition layers

The DenseNet-BC with bottleneck structure and dimension reduction at transition layers is particularly parameter-efficient.

Feature Reuse & Collective Knowledge

  • Collective Knowledge

    每一层均可获得其较前层的 feature map。这些不同层的 feature map 共同构成了 collective knowledge

    One explanation for this is that each layer has accessto all the preceding feature-maps in its block and, therefore,to the network’s “collective knowledge”.

  • Feature Reuse

    L L L 层的 DenseNet 有 L ( L + 1 ) 2 \frac{L(L+1)}{2} 2L(L+1) 条 connection,这些 connection 实现了 feature reuse。

    • 同block的layers通过 shortcut connection 直接利用前层的 feature map
    • 不同block的layers通过 transition layers 利用被降维的前层的 feature map

    在这里插入图片描述

    • 同block深层可以直接利用浅层特征

      Dense Block 中每层都会把权重分散同block的许多input

      All layers spread their weights over many inputs within the same block. This indicates that features extracted by very early layers are, indeed, directly used by deep layers throughout the same dense block.

    • transition layers 实现了间接的特征复用

      transition layers 也将权重分散到了之前 Dense Block 的层中

      indicating information flow from the first to the last layers of the DenseNet through few indirections.

    • transition layers 输出冗余

      第2和第3个 Dense Block 中对 transition layers 的输出都分配了最低的权重,说明 transition layers 的输出特征冗余(即便在 transition layers 进行了 Compression 也是如此)

      The layers within the second and third dense block consistently assign the least weight to the outputs of the transition layer (the top row of the triangles), indicating that the transition layer outputs many redundant features (with low weight on average). This is in keeping with the strong results of DenseNet-BC where exactly these outputs are compressed.

    • 深层中依然会产生 high-level 的信息

      最后一层分类器权重分散到其所有的输出,但明显地偏向最终的 feature map,说明网络的深层依旧产生 high-level 特征

      Although the final classification layer, shown on the very right, also uses weights across the entire dense block, there seems to be a concentration towards final feature-maps, suggesting that there may be some more high-level features produced late in the network.

Implicit Deep Supervision

分类器可以通过更短的路径(至多2~3个 transition layers),去直接监督所有的层,从而实现隐性的 deep supervision

One explanation for the improved accuracy of dense convolutional networks may be that individual layers receive additional supervision from the loss function through the shorter connections.

DenseNets perform a similar deep supervision in an implicit fashion: a single classifier on top of the network provides direct supervision to all layers through at most two or three transition layers.

其实 ResNet 也具有 Deep Supervision 的思想,即深层的分类器直接监督浅层,详见论文:Residual Networks Behave Like Ensembles of Relatively Shallow Networks ,该论文在 [ResNet](./[paper reading] ResNet.md) 中有详细的解读。

Diversified Depth

  • DenseNet 是 statistic depth 的一个特例

    there is a small probability for any two layers, between the same pooling layers, to be directly connected—if all intermediate layers are randomly dropped.

  • 针对 ResNet 的 ensemble-like behavior 同样适用于 DenseNet

    因为 ensemble-like behavior 的基础 “collection of path of different length” 在 DenseNet 依旧成立

Key Element

Dense Block

在这里插入图片描述

Transition Layers

==> 实现 dense block 之间的 Down Sampling

Components(组成部分)

依次包括以下三个部分:

  • Batch Normalization
  • 1x1 Conv
  • 2x2 average pooling

Compression(压缩)

对于一个 dense block 产生的 m m m 个feature map,Transition Layers 会生成 ⌊ θ m ⌋ \lfloor \theta_m \rfloor θm 个 feature map,其中 compression factor θ \theta θ 满足 $0<\theta \leqslant 1 $

If a dense block contains m m m feature-maps, we let the following transition layer generateb ⌊ θ m ⌋ \lfloor \theta_m \rfloor θm output feature-maps, where 0 < θ ≤ 1 0<θ≤1 0<θ1 is referred to as the compression factor.

在实验中, θ \theta θ 选择为 0.5 (同时使用 θ < 1 \theta<1 θ<1 和 bottleneck 的模型称为 DenseNet-BC

Growth Rate

  • 实际意义

    每层对 global state 贡献多少的 new information (因为每层会自己产生 k k k 个 feature map)

    The growthrate regulates how much new information each layer contributes to the global state.

    • ℓ t h \ell^{th} th 层的 input feature map 的channel数 ==> 前层的 feature map 在深度上叠加
      k 0 + k × ( ℓ − 1 ) k_0 + k ×(\ell - 1) k0+k×(1)

    • ℓ t h \ell^{th} th 层的 output feature map 的channel数 ==> 固定值 Growth Rate k k k
      k k k

      至于为什么能做到每层的 output feature map 的 channel 固定Growth Rate k k k,参见 [bottleneck layers](#bottleneck layers)

Bottleneck Structure

  • 原因和优势

    • 原因

      如果不增加 bottleneck layers,每个 layer 的输出 feature map 的通道指数增长

      举一个例子,假设每层都依照 Growth Rate 产生 k 0 k_0 k0 个 channel 的 feature map。则:

      • 第1层 feature map 的 channel:
        c 1 = k 0 c_1 = k_0 c1=k0

      • 第2层 feature map 的 channel:
        c 2 = k 0 + k 0 = 2 k 0 c_2 = k_0 + k_0 = 2k_0 c2=k0+k0=2k0

      • 第3层 feature map 的 channel:
        c 3 = k 0 + 2 k 0 + k 0 = 4 k 0 c_3 = k_0 + 2k_0 + k_0 = 4k_0 c3=k0+2k0+k0=4k0

      • ……

      • ℓ \ell 层 feature map 的 channel:
        c ℓ = 2 ℓ − 1 ⋅ k 0 c_{\ell} = 2^{\ell-1}·k_0 c=21k0

      这种指数级别的通道数是不允许存在的,过多的通道数会极大的增加参数量,从而极大降低运行速度。

    • 优势

      1. 提高了计算效率
      2. 学习紧凑的 feature representation
  • 原理

    注意:1x1 Conv 的位置是在 3x3 Conv**(正常操作)之前**,先对 input feature map 进行降维。

    否则起不到 computational efficiency 的效果

    每个 3x3 Conv 前加上 1x1 Conv ,对 channel 维度进行降维压缩

    a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency.

    B N − R e L U − C o n v ( 1 × 1 ) = = > B N − R e L U − C o n v ( 3 × 3 ) BN-ReLU -Conv(1×1) ==> BN-ReLU -Conv(3×3) BNReLUConv(1×1)==>BNReLUConv(3×3)

  • 参数设置

    在论文中,作者令每个 1x1 Conv 产生 4 k k k feature maps(将对应网络结构称为 DenseNet-B

Math

本文没有大量的数学公式,故将math分散在了各章节。

Use Yourself

[Bottleneck Structure](#Bottleneck Structure)

bottleneck structure 是在 block-level 起作用,在以下方面具有良好的作用:

  • 控制channel维度
  • 提高参数效率
  • 提高计算效率

[Transition Layers](#Transition Layers)

bottleneck structure 是在 layer-level 起作用,优势与 Bottleneck Structure 类似:

  • 控制channel维度
  • 提高参数效率
  • 提高计算效率

[Feature Reuse](#Feature Reuse)

具有以下的优点:

  • multi-level:可以同时利用 low-level 和 high-level 的优势
  • multi-scale:low-level 一般具有较高的空间分辨率,而 high-level 一般具有较低的空间分辨率
  • model compactness:避免了特征的重复学习

Articles

Blogs