论文笔记：Deeper and Wider Siamese Networks for Real-Time Visual Tracking

时间 2019-12-06

标签论文笔记 deeper wider siamese networks real time visual tracking 繁體版

原文原文链接

Deeper and Wider Siamese Networks for Real-Time Visual Tracking
Updated on 2019-04-01 16:10:37

git

Paper (arXiv V3)：https://arxiv.org/pdf/1901.01660.pdf
github

Code：https://github.com/researchmm/SiamDW (Training and Testing for SiamFC, but Testing only for SiamRPN)网络

1. Background and Motivation: app

本文主要是很好的处理了跟踪问题中一个很奇特的现象：“随着网络层数的层数（用现有的 ResNet, Inception 等网络来替换经常使用的 Backbone net，例如 AlexNet），跟踪结果不增反而下降的状况”。以下图所示：ide

做者发现以下的几个参数，对跟踪结果的影响，很是巨大：* the receptive field size of neurons; * network stride; * feature padding 。性能

具体来讲，感觉野决定了用于计算 feature 的图像区域。较大的感觉野，提供了更好的 image context 信息，而一个较小的感觉野可能没法捕获目标的结构信息；学习

网络的步长，影响了定位准确性的程度，特别是对小目标而言；与此同时，它也控制了输出 feature map 的大小，从而影响了 feature 的判别性和检测精度。ui

此外，对于一个全卷积的结构来讲，feature padding 对卷积来讲，会在模型训练中，引入潜在的位置偏移，从而使得当一个目标移动到接近搜索范围边界的时候，很难作出准确的预测。这三个因素，同时形成了 Siamese Tracker 没法很好的从更顶尖的模型中收益。idea

本文中，做者尝试从设计新的网络结构的基础上，来解决上述问题，从而使得 SiamNet 得到更好的跟踪性能。创新点主要在于：spa

1. 做者基于 the "boottleneck" residual block 来提出一组 cropping-inside residual (CIR) units。该模块能够消除 padding 带来的影响，从而组织卷积核学习 position bias；

2. 咱们设计了两种网络结构，经过堆叠 the CIR units，称为 Deeper and Wider networks。在这个网络中，步长和神经感觉野被用于加强定位的准确性；

3. 做者将所设计的 backbone network 用到 SiamFC 和 SiamRPN 网络中。做者的实验证实，在多个数据集上，均可以获得大幅度的提高。另一个优点是：本文所设计的网络结构是轻量级的，容许跟踪器能够实现实时跟踪。

2. Background on Siamese Tracking:

关于孪生网络的跟踪器，能够参考其原始文章。

3. Analysis of Performance Degradation:

3.1 性能分析：

做者对不一样 backbone 的网络结构，做者发现不一样的影响因子（包括：stride (STR), padding (PAD), receptive field (RF) of neurons in the last layers, and output feature size (OFS)）对跟踪结果的影响不一样，并且有些参数对结果的退化影响很是大，以下表所示：

做者得出以下的结论：

1). This illustrates that Siamese trackers prefer mid-level features (stride 4 or 8), which are more precise in object localization than high-level features (stride ≥ 16).

2). For the maximum size of receptive field (RF), the optima lies in a small range. In the cases of AlexNet, VGG-10 and ResNet-17, the optimal receptive field size is about 60%∼80% of the input exemplar image z size (e.g. 91 vs 127). It illustrates that the size of RF (感觉野) is crucial for feature embedding in a Siamese framework.
3). only RF in a certain size range allows the feature to abstract the characteristics of the object, and its ideal size is closely related to the size of the exemplar image.

4). For the output feature size, it is observed that a small size (OFS ≤ 3) does not benefit tracking accuracy.

5). Network padding has a highly negative impact on the final performance.

上面表格 2，展现了 AlexNet 和 VGG-10 都不带 padding，而 Inception 和 ResNet 都带有 padding。

做者发现，这种 padding 会致使以下的问题：lead to inconsisitency between embeddingings of target object appearing at different positions in search images, and therefore, the matching similarity comparison degrades. 当一个物体移动到图像边缘时，其峰值再也不可以准确的反应目标的位置。当跟踪器没法在上一帧准肯定位时，这一般就会致使跟踪器漂移。

3.2 Guidelines：

根据上述实验和观察，做者给出了以下的四个基础的指南，来下降上述影响因子的干扰：

* Siamese trackers prefer a relatively small network stride.

* The receptive field of output features should be set based on its ratio to the size of the exemplar image.

* Network stride, receptive field and output feature size should be consisdered as a whole when designing a network architecture.

* For a fully convolutional Siamese matching network, it is critical to handle the problem of perceptual inconsistency between the two network streams.

4. Deeper and Wider Siamese Networks:

4.1 Cropping-Inside Residual (CIR) Units:

CIR Unit. 在原始版本的 Residual 单元中，是带有 padding，而以前的观测代表 padding 会致使 Siamese Tracker 位置偏移。因此，咱们应该 remove 掉这个 padding 的过程，而后使其适应 Siamese Tracker。为了达到这个目的，咱们用一个 cropping operation 来加强 residual unit，即：在特征相加完成后，加一个 crop 操做（下图淡蓝色标记）。这个 cropping 操做符移除了被 zero-padding signals 所影响的 feature。因为 bottleneck layer 的 padding size 是 1，仅仅最边缘的 features 被删除。这个简单的操做极大的移除了残差单元中的 padding-affected features。

Downsampling CIR (CIR-D) Unit. 下采样残差单元也是网络设计中一个重要的构建模块。其用于下降 feature map 的空间大小，同时使得 feature channels 变为两倍。因为这个模块中也包含 padding，因此也采用 crop 操做。做者将卷积的步长，由 2 设置为 1。这些改变的关键点在于：确保仅因为padding引发的feature被删除，而内部模块的结构不变。

CIR-Inception and CIR-NeXt Units. 做者也将这种结构用于构建 multi-branch structure，确保其能够构建 wide 的网络。

4.2 Network Architecture：

做者将上述网络结构，经过堆叠的方式，设计出了多个版本的 backbone，并在表格 3 中展现了 4 种不一样深度的结构（16, 19, 22 and 43）。

此外，做者也设计了两种 wide 的网络结构，即表格 3 中的 CIResInception-22 and CIResNeXt-22。

5. Experiments：