论文笔记：SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks

时间 2019-12-04

标签论文笔记 siamrpn evolution siamese visual tracking deep networks 繁體版

原文原文链接

SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networkshtml

2019-04-02 12:44:36git

Paper：https://arxiv.org/pdf/1812.11703.pdf github

Project：https://lb1100.github.io/SiamRPN++ 网络

Official Code: https://github.com/STVIR/pysot ide

Unofficial Pytorch Implementation: https://github.com/PengBoXiangShang/SiamRPN_plus_plus_PyTorch (Support Multi-GPU and LMDB data preprocessing) 性能

1. Background and Motivation: 学习

与 CVPR 2019 的另外一篇文章 Deeper and Wider Siamese Networks for Real-Time Visual Tracking 相似，这篇文章也是为了解决 Siamese Tracker 没法利用 Deep Backbone Network 的问题。做者的实验发现，较深的网络，如 ResNet, 没法带来跟踪精度提高的缘由在于：the distroy of the strict translation invariance。由于目标可能出如今搜索区域的任何位置，因此学习的target template 的特征表达应该保持 spatial invariant，而做者发现，在众多网络中，仅仅 AlexNet 知足这种约束。本文中，做者提出一种 layer-wise feature aggravation structure 来进行 cross-correlation operation，帮助跟踪器从多个层次来预测类似形图。编码

此外，做者经过分析 Siamese Network 发现：the two network branches are highly imbalanced in terms of parameter number; 做者进一步提出 depth-wise separable correlation structure，这种结构不但能够大幅度的下降 target template branch 的参数个数，还能够稳定整个模型的训练。此外，另外一个有趣的现象是：objects in the same categories have high response on the same channels while responses of the rest channels are supressed. 这种正交的属性可能有助于改善跟踪的效果。spa

2. Analysis on Siamese Networks for Tracking: rest

各类实验说明了 stride，padding 对深度网络的影响。

3. ResNet-driven Siamese Tracking :

为了下降上述影响因子对跟踪结果的影响，做者对原始的 ResNet 进行了修改。由于原始的残差网络 stride 为 32，这个参数对跟踪的影响很是之大。因此做者对最后两个 block 的有效 stride，从 32 和 16 改成 8，而且经过 dilated convolution 来增长 receptive field。利用 1*1 的卷积，将维度降为 256。可是这篇文章，并无将 padding 的参数进行更改，因此 template feature map 的空间分辨率增长到 15，这就在进行 correlation 操做的时候，计算量较大，影响跟踪速度。因此，做者从中 crop 一块 7*7 regions 做为 template feature，每个 feature cell 仍然能够捕获整个目标区域。做者发现仔细的调整 ResNet，是能够进一步提高效果的。经过将 ResNet extractor 的学习率设置为 RPN 网络的 1/10，获得的 feature 能够更加适合 tracking 任务。

4. Layer-wise Aggregation :

本文是想利用多层特征的聚合来提高特征表达，提高跟踪结果。做者从最后三个残差模块，获得对应的输出：F3(z), F4(z) 以及 F5(z)。因为多个 RPN 模块的输出，有相同的分辨率。因此，直接对这几个结果进行加权求和，能够表达为：

5. Depthwise Cross Correlation :

Cross correlation module 是映射两个分支信息的核心操做。SiamFC 利用 Cross-Correlation layer 来获得单个通道响应图进行位置定位。在 SiamRPN 中，Cross-Correlation 被拓展到更加高层的信息，例如 anchors，经过增长一个 huge convolutional layer 来 scale the channels (UP-Xcorr)。这个 heavy up-channel module 使得参数很是不平衡（RPN 模块包含 20M 参数，而特征提取部分仅包含 4M 参数），这就使得 SiamRPN 变的很是困难。因而做者提出一个轻量级的 cross correlation layer，称为：Depthwise Cross Correlation (DW-XCorr)，以获得更加有效的信息贯通。DW-XCorr layer 包含少于 10 倍的参数（相比于 UP-XCorr used in RPN），而性能却能够保持不降。

为了达到这个目标，做者采用一个 conv-bn block 来调整特征，来适应跟踪任务。Bounding box prediction 和基于 anchor 的分类都是非对称的 (asymmetrical)。为了编码这种不一样，the template branch 和 search branch 传输两个 non-shared convolutional layers。而后，这两个 feature maps 是有相同个数的 channels，而后一个 channel 一个 channel 的进行 correlation operation。另外一个 conv-bn-relu block，用于融合不一样 channel 的输出。最终，最后一个卷积层，用于输出 classification 和 regression 的结果。

经过用 Depthwise correlation 替换掉 cross-correlation，咱们能够很大程度上下降计算代价和内存使用。经过这种方式，template 和 search branch 的参数数量就会趋于平衡，致使训练过程更加稳定。

另外一个有意思的现象是：the objects in the same category have high response on same channels, while response of the rest channels are supressed。也就是说，同一类的物体在同一个 channel 上，都有较高的响应，而其余的 channels 上则被抑制。以下图所示：

6. Experimental Results：