LIFT 论文详解

LIFT

本文提供了相应的theano 和 tensorflow，论文是比较早期的探索利用CNN的方法去学习特征的工做，并且该组是瑞士联邦理工学院的 cvlab，以前也作过不少 deep feature 和三维视觉相关的研究，该工做很值得研究一下。html

主要思路

本文利用CNN网络特征点提取，ori 估计和特征描述符的计算，并且是在统一模型框架里面学习这三个子任务。git

基本流程

总体 pipeline

本文提出的统一模型框架 LIFT 的总体 pipeline 以下图所示：github

其中包括主要的部分 Detector、Orientation Estimator 和 Descriptor。每一个子任务都是单独的 CNN 网络，以前的工做 TILDE、Learn Orientation 和 DeepDesc 已经证实单独任务利用 CNN 网络能够处理的很好，本文则把它们统一到大框架里一块儿学习，整个网络架构是端对端可导的。web

其中为了把这三个任务合并在一块儿，这里获得 Detector 和 Orientation Estimator 任务的输出结果后，利用 Spatial Transformers 层获得 patch 做为 Descriptor 任务的输入。算法

其中用 soft argmax 替代传统检测算法 non-local maximum suppression(NMS) 算法。这样作的好处是整个 pipeline 都是可导的，这样能够统一利用反向传导训练，以前没有其余相似的工做，这个是第一次尝试。windows

网络架构

LIFT 总体的网络架构以下图所示：网络

其中网络的输入是 image patches，而不是整张图像，主要是图像中大部分区域并不包含特征点。这些 image patches 是根据特征点提取的，特征点是 SIFT-SfM 构建的，后面会详细讨论。有个问题是，训练能够这样制造 patches，可是测试时 patches 怎么得到？？？并且 image patch 尽量选择的小，这样保证在给定的这个 scale 的 image patch 上只有一个主要的特征点存在，减小了查找该 image patch 其余特征点时间。有点奇怪，这还叫特征点 Detector？？？架构

其中总体网络架构包含四个分支，每一个分支都包含 Detector、Orientation Estimator 和 Descriptor 三个不一样的 CNN 网络。框架

在训练过程当中，使用 image patches 四元组做为输入。四元组包含两个匹配的 image patches， $\mathbf{P}^1$ 和 $\mathbf{P}^2$ （对应同一个 3D 点，在不一样的 view 上），第三个 image patch $\mathbf{P}^3$ 表示不一样的 3D 对应的 patches，第四个 image patch $\mathbf{P}^4$ 表示不对应任何 3D 点，上面也不包含特征点。四元组中的四个 image patches 对应网络结构中的四个分支。可是第四个分支去的其实有问题，有些 patches 在 SfM 中并无对应 3D 点可能有不少缘由，不必定就不包含特征点，这种数据设计是由噪声的？？？？ide

为了实现端到端的求导，每一个分支各个任务的链接关系以下：

输入 image patch $\mathbf{P}$ ，Detector 输出 score map $\mathbf{S}$ ；
在 score map $\mathbf{S}$ 上执行 soft argmax 获取特征点位置 $\mathbf { x }$ ；若是不存在呢？？？
以特征点位置 $\mathbf { x }$ 为中心，利用 Spatial Transformer 层去 Crop 提取小 patch $\mathbf{p}$ 做为 Orientation Estimator 的输入；
Orientation Estimator 估计 $\mathbf{p}$ 的 orientation $\theta$ ；
而后利用 Spatial Transformer 层根据 $\theta$ 去 rotate $\mathbf{p}$ 获得 $\mathbf{p}_{\theta}$ ；
$\mathbf{p}_{\theta}$ 输入到 Descriptor 网络获得最终的特征向量 $\mathbf{d}$ ；

soft argmax 使 argmax 变为可导，将 score map 转换为具体坐标，须要 check 具体公式；

这里提供的 Spatial Transformer 层不具备学习参数，只是为了保持总体可导，操做须要的参数在前面 CNN 中已经求得 $\mathbf { x }$ 和 $\theta$ ，Spatial Transformer 层只须要根据这两个值对 image patches 进行 Crop 和 Rot 便可。

整个网络做为总体一块儿从 scratch 开始 train 比较难收敛。因此本文设计了一种针对特定任务学习流程，首先学习 Descriptor 的参数，而后基于学习到的参数学习 Orientation Estimator 的参数，最后根据前两个已经学到的参数去学习 Detector 的参数。从后向前去训练每一个部分，这样最后的梯度也能够正确的反向传导。

构建训练数据集

从 1DSfM 提供的 13 个数据集中选择 Piccadilly Circus 和 Roman Forum 两个数据集。而后利用 VisualSFM 进行重建(基于 SIFT 特征点)获得 3D 点。具体重建后每一个数据集内容以下：

Piccadilly 包含 3384 张图像， $59 k$ 个 3D 点，平均每一个 3D 点有 6.5 个图像观察到；
Roman-Forum 包含 1658 张图像， $51 k$ 个 3D 点，平均每一个 3D 点有 5.2 个图像观察到；

数据集中的一些图像示例以下：

其中左边图像来自 Piccadilly 数据集，右边图像来自 Roman-Forum 数据集。在 SfM 重建过程当中保留下来的特征点为蓝色，其余的为红色。

把上面收集到的数据集分为训练集和验证集，若是训练集中某些 3D 在验证集合也被观察到，那就删除验证集合中观察到这些点的 view，同理也删除训练集中观察到验证集中 3D 点的那些 view。具体怎么划分的感受仍是有点问题？？？

构建正样本匹配的 patch pair 只从 SfM 重建保留下来的特征中选择(这些点鲁棒性强，当作特征点合适)。同时为了提取不包含任何特征点的 patches(上面说的网络结构中第四个分支的输入)，随机采用不包含 SIFT 特征的图像区域，固然那些在 SfM 过程没保留下来的特征点也不能被包含。

根据点的 scale $\sigma$ 值，去在原图上提取灰度的 image patches $\mathbf{P}$ 。 patches $\mathbf{P}$ 在给定的位置上提取 $\sigma \times 24 \sigma$ 的区域，而后标准化到 $\times S$ ，其中 $S = 128$ 。而后小的 patches $\mathbf{p}$ 和 $\mathbf{p}_{\theta}$ 做为 Orientation Estimator 和 Descriptor 网络的输入，大小都是 $\times s$ ，其中 $s = 64$ 。

这里后面说小 patches 和 SIFT描述符的支持区域大小 $\sigma$ 对应起来，不太明白？？？

为了防止 overfitting，在 patch 位置上作了随机扰动，范围是 $\% ( 4.8 \sigma )$ 。最后利用整个数据集灰度图的均值和标准差归一化输入的 patches。

Descriptor

本文利用 DeepDesc 提供的网络结构去提取 patches 的描述符。在训练 Descriptor 网络的时候，Detector 和 Orientation Estimator 不参与训练。该网络的输入的是 Orientation Estimator 输出 $\mathbf{p}_{\theta}$ 。可是此时前面的网络尚未训练，不能自动生成，这里使用 SfM 获得的特征点的位置和 ori 来生成 $\mathbf{p}_{\theta}$ ，当作 Descriptor 网络的训练数据。

训练 Descriptor 网络，主要是最小化匹配 patch pairs 之间的 Euclidean 距离，最大化非匹配patch pairs 之间的 Euclidean 距离，具体的 loss 公式以下：

$\mathcal { L } _ { \mathrm { desc } } \left( \mathbf { p } _ { \theta } ^ { k } , \mathbf { p } _ { \theta } ^ { l } \right) = \left\{ \begin{array} { l } { \left\| h _ { \rho } \left( \mathbf { p } _ { \theta } ^ { k } \right) - h _ { \rho } \left( \mathbf { p } _ { \theta } ^ { l } \right) \right\| _ { 2 } } & { \text { for positive pairs, and } } \\ { \max \left( 0 , C - \left\| h _ { \rho } \left( \mathbf { p } _ { \theta } ^ { k } \right) - h _ { \rho } \left( \mathbf { p } _ { \theta } ^ { l } \right) \right\| _ { 2 } \right) } & { \text { for negative pairs } } \end{array} \right.$

其中 $\mathbf { d } = h _ { \rho } \left( \mathbf { p } _ { \theta } \right)$ ， $h (.)$ 表示 Descriptor 网络， $\rho$ 表示 CNN 网络参数。

其中 $C = 4$ 表示非匹配 patch pair 的最大距离边界（距离更远 loss 不继续增长了）。

在训练 Descriptor 网络使用 hard mining 方法，和上面 DeepDesc 中使用方法同样，在 DeepDesc 中也看出该策略对最终的描述符的性能很重要。基于该策略，总共输入 $K _ { f }$ 个 pairs，而后只取 loss 最高的前 $K _ { b }$ 个 paris 的 loss 进行反向传播， $\geq 1$ 表示 mining ratio。在 DeepDesc 工做中，网络预训练没使用 mining 策略，fine-tune 时候设置 $r = 8 $ 。本文使用增量是的 mining 策略，训练开始 $r = 1 $ ，而后每 5000 个 batches 后 $r $ 翻倍。这里每一个 batch 包括 128 对正样本，128 对负样本。

Orientation Estimator

本文进行 Orientation Estimator 的思路和 Learn Orientation 相似。但该方法须要预先计算好多个 orientation 的描述符向量，而后计算相对于 orientation 的 Jacobian 矩阵。这里说的 Jacobian 具体含义？？？在本文中 Detector 的输入没有直接处理，而是做为整个 pipeline 的一部分，因此预先计算描述符向量是不可能的。

基于上面的考虑，本文采用 Spatial Transformers 去学习 orientation。根据 Detector 网络输出的区域位置能够获得 patch $\mathbf { p }$ ，而后 Orientation Estimator 估计一个 orientation，公式以下：

$\theta = g _ { \phi } ( \mathbf { p } )$

其中 $g (.)$ 表示 Orientation Estimator 网络， $\phi$ 表示 CNN 网络参数。

这样给定原始的 image patch $\mathbf { P }$ ，以及 Detector 网络的输出位置 $\mathbf { x }$ ，还有第二个 Spatial Transformer 层 $\operatorname { Rot } ( . )$ 估计的 $\theta$ ，这样 Descriptor 网络的输入获得了 $\mathbf { p } _ { \theta } = \operatorname { Rot } ( \mathbf { P } , \mathbf { x } , \theta )$ 。

这样在训练 Orientation Estimator 网络的时候，能够最小化相同 3D 点在不一样 views 下的特征向量的距离，loss 最终仍是落在 Descriptor 网络上。同时在训练时固定前面已经训练好的 Descriptor 的参数，同时 Detector 仍是不继续训练，使用 SfM 获得的特征点的位置信息生成。训练 Orientation Estimator 网络 loss 的公式以下：

$\mathcal { L } _ { \text { orientation } } \left( \mathbf { P } ^ { 1 } , \mathbf { x } ^ { 1 } , \mathbf { P } ^ { 2 } , \mathbf { x } ^ { 2 } \right) = \left\| h _ { \rho } \left( G \left( \mathbf { P } ^ { 1 } , \mathbf { x } ^ { 1 } \right) \right) - h _ { \rho } \left( G \left( \mathbf { P } ^ { 2 } , \mathbf { x } ^ { 2 } \right) \right) \right\| _ { 2 }$

简单来讲就是最小化匹配 pairs 特征向量之间的 Euclidean 距离。

其中 $\mathbf { P } , \mathbf { x } ) = \operatorname { Rot } \left( \mathbf { P } , \mathbf { x } , g _ { \phi } ( \operatorname { Crop } ( \mathbf { P } , \mathbf { x } ) ) \right)$ ，表示前面的 crop 和 rotate 操做。

其中 $\left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } \right)$ 表示同一个 3D 点的投影对应的 image patches， $\mathbf { x } ^ { 1 }$ 和 $\mathbf { x } ^ { 2 }$ 分布表示投影位置。

Detector

Detector 网络输入一个 image patch，返回一个 score map。本文的方法和 TILDE 相似，卷积层后面跟着分段线性激活函数，具体以下：

$\mathbf { S } = f _ { \mu } ( \mathbf { P } ) = \sum _ { n } ^ { N } \delta _ { n } \max _ { m } \left( \mathbf { W } _ { m n } * \mathbf { P } + \mathbf { b } _ { m n } \right)$

其中 $\mu } ( \mathbf { P } )$ 表示 Detector 网络， $\mu$ 表示 CNN 网络参数。

其中 $\delta _ { n } = \left\{ \begin{array} { l } { +1 } & { \text { for n is odd, and } } \\ { -1} & { \text { otherwise } } \end{array} \right.$

$N$ 和 $M$ 是超参，控制分段线性激活函数的复杂度。

上面公式部分有点不明白

这里和 TILDE 主要的不一样是本文使用 score map 中最大值来隐式的表示位置信息，而不是像 TILDE 网络直接去回归 SfM 获得的特征点固定的位置，在实验中发现这样直接回归位置会下降性能。

而后有 score map $\mathbf { S }$ ，能够获得特征点的位置：

$\mathbf { x } = \text { softargmax } ( \mathbf { S } )$

其中 softargmax 函数主要计算 score map 的质心，具体公式以下：

$\operatorname { softargmax } ( \mathbf { S } ) = \frac { \sum _ { \mathbf { y } } \exp ( \beta \mathbf { S } ( \mathbf { y } ) ) \mathbf { y } } { \sum _ { \mathbf { y } } \exp ( \beta \mathbf { S } ( \mathbf { y } ) ) }$

其中 $\mathbf {y}$ 是 score map $\mathbf { S }$ 的位置， $\beta = 10$ 是控制 softargmax 函数平滑度的超参。softargmax 也能够理解为一个可导的非极大值抑制函数（NMS）。 $\mathbf {x}$ 和 image patch $\mathbf {P}$ 输入到第一个 Spatial Transformer 层 $\text { Crop } ( . )$ 函数里面， $\mathbf { p } = \operatorname { Crop } ( \mathbf { P } , \mathbf { x } )$ 能够做为 Orientation Estimator 的输入。

并且以前 Orientation Estimator 网络和 Descriptor 网络都已经训完成，这样就能够固定这两个网络的参数，基于整个 pipeline 来训练。就像上面提到的，输入训练四元组 $\left( \mathbf { P } _ { 0 } ^ { 1 } , \mathbf { P } ^ { 2 } , \mathbf { P } ^ { 3 } , \mathbf { P } ^ { 4 } \right)$ ，最小化 loss 总和，具体以下：

$\mathcal { L } _ { \text { detector } } \left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } , \mathbf { P } ^ { 3 } , \mathbf { P } ^ { 4 } \right) = \gamma \mathcal { L } _ { c l a s s } \left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } , \mathbf { P } ^ { 3 } , \mathbf { P } ^ { 4 } \right) + \mathcal { L } _ { p a i r } \left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } \right)$

其中 $\gamma$ 是平衡上面两个 loss 的超参。

首先须要对输入的 image patch 进行分类，判断该 patch 上是否是包含一个特征点：

$\mathcal { L } _ { \mathrm { class } } \left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } , \mathbf { P } ^ { 3 } , \mathbf { P } ^ { 4 } \right) = \sum _ { i = 1 } ^ { 4 } \alpha _ { i } \max \left( 0 , \left( 1 - \operatorname { softmax } \left( f _ { \mu } \left( \mathbf { P } ^ { i } \right) \right) y _ { i } \right) \right) ^ { 2 } \\$

其中 $ \left{ \begin{array} { l } { y _ { i } = - 1 \text { and } \alpha _ { i } = 3 / 6 } & { \text { for i = 4, and } } \ { y _ { i } = + 1 \text { and } \alpha _ { i } = 1 / 6 } & { \text { otherwise } } \end{array} \right.$ 主要是用于正负样本，是否是特征点 patch。这里的分类应该主要是该 patch 是否是包含特征点，是和否，只有两类，可是输入 score map，怎么进行两类的 softmax？？？

而后须要肯定特征点位置，这里假设匹配 patch 学习到的位置须要尽量使根据该位置计算出的描述符直接的距离最近，公式以下：这样其实有可能带来反作用，描述符最近的位置点不必定是正确的匹配点位置？？？
$\begin{aligned} \mathcal { L } _ { \mathrm { pair } } \left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } \right) = \| & h _ { \rho } \left( G \left( \mathbf { P } ^ { 1 } , \operatorname { softargmax } \left( f _ { \mu } \left( \mathbf { P } ^ { 1 } \right) \right) \right) \right) - h _ { \rho } \left( G \left( \mathbf { P } ^ { 2 } , \operatorname { softargmax } \left( f _ { \mu } \left( \mathbf { P } ^ { 2 } \right) \right) \right) \right) \end{aligned}$

这里三个组件一块儿来促进 Detector 网络的训练，同时设置 Descriptor 网络的 mining ratio 为 $r = 8 $ 。

同时文中提到在训练 Descriptor 网络时，已经学习到一些不变性(平移或者说是位置的不变性)，这样对于 Detector 网络来讲很难进一步去学习到有用的信息了。为了让 Detector 网络去学习到正确的区域，预训练时限制学习到的位置匹配 patch 必须彻底 overlap 在一块儿，是否是实际输入的位置在 image patch 上就是同样的？？？，后面继续训练时解除限制。

预训练时的 loss 用下面的替换：

$\tilde { \mathcal { L } } _ { \mathrm { pair } } \left( \mathbf { P } ^ { 1 } , \mathbf { P } ^ { 2 } \right) = 1 - \frac { \mathbf { p } ^ { 1 } \cap \mathbf { p } ^ { 2 } } { \mathbf { p } ^ { 1 } \cup \mathbf { p } ^ { 2 } } + \frac { \max \left( 0 , \left\| \mathbf { x } ^ { 1 } - \mathbf { x } ^ { 2 } \right\| _ { 1 } - 2 s \right) } { \sqrt { \mathbf { p } ^ { 1 } \cup \mathbf { p } ^ { 2 } } }$

当 $\tilde { \mathcal { L } } _ { \text { pair } } = 0$ 也就是两个 patch 彻底 overlap 在一块儿。

其中 $\mathbf { x } ^ { j } = \operatorname { softargmax } \left( f _ { \mu } \left( \mathbf { P } ^ { j } \right) \right)$ ， $\mathbf { p } ^ { j } = \operatorname { Crop } \left( \mathbf { P } ^ { j } , \mathbf { x } ^ { j } \right)$ ， $\| \cdot \| _ { 1 }$ 是 $\text { norm }$ 。

其中 $s = 64$ 是 $\mathbf { p }$ 的长和宽。

Pipeline

总体的运行框架以下图所示：

虽然本文方法训练时在 image patches 上进行的，可是测试的输入是整张图像，这里采用在整张图像是进行 sliding window 选取 image patches 的方法。但这样操做，时间花费太大。但幸运的是，Orientation Estimator 和 Descriptor 网络只须要在局部最大值上运行，而不须要在全部的 windows 上运行。这里就把 Detector 网络单独拿出来在整张图像上运行，如上图红框所示，并且是在多个 scale 上运行的，这样把多个 patches 的 score map 合并到原图上，获得了 score pypamid，后面用 NMS（和 SIFT 中使用的一致）方法代替网络里面的 softargmax 获得最终的 keypoints 的位置。后面的流程就和训练一致了。

实验

数据集和试验构建

三个标准数据集：

Strecha 数据集，包含 2 个 scenes 的 19 张 viewpoint 变化的图像。
DTU 数据集，包含 60 个 objects 的 60 个序列，包括了 viewpoint 和 illumination，网址以下data。用该数据集评价在不一样 viewpoint 下本文方法性能。
Webcam 数据集，包含 6 个 scenes 的 710 张 illumination 变化的图像（同一个 viewpoint ）。用该数据集评价在不一样 illumination 下本文方法性能。

对于 Strecha 和 DTU 数据集，用于原文做者提供的真值构建匹配关系。每张图像最多使用 1000 个 keypoints，利用 A performance evaluation of local descriptors 提出的评估方法进行评估，主要包含如下指标：

Repeatability (Rep.) ：度量特征点的可重复性，表示为一个比例值。主要是评价特征点 Detector 性能，具体指特征点在真值区域被发现的比例。
Nearest Neighbor mean Average Precision (NN mAP) ：主要是评价描述符 Descriptor 的可区分度，具体指在不一样描述符距离阈值下的 Precision-Recall 曲线的 Area Under Curve (AUC)，使用 NN 匹配策略。
Matching Score (M. Score) ：主要是度量整个 pipeline 的性能，具体指真值匹配关系被发现的比例。

效果对比

和 SIFT 对比效果以下图所示：

其中左边是 SIFT 结果，右边是本文结果。从上到下测试数据来源分别为Strecha，Webcam，DTU scene 7 和 DTU scene 19 数据集。能够看出本文方法能够获得更多的正确匹配关系。

训练使用的 Piccadilly 数据集，训练集和测试集的区别仍是比较大的，可是效果都还不错，说明泛化性能比较强。

整个 pipeline 的量化评估

下图是在三个测试数据集上平均 matching score 对比，具体结果以下图所示：

其中 LIFT (pic) 是在 Piccadilly 数据集上训练的，LIFT (rf) 是在 Roman-Forum 数据集上训练的。

同时上面看出 SIFT 效果要好于 VGG，DeepDesc 和 PN-Net 等深度学习方法。

并且对于某些方法单个部分好比 Detector 和 Descriptor 性能可能会好，但在总体 pipeline 评估中性能不必定保持。这也说明整个 pipeline 要放在一块儿进行学习，像本文方法同样，并且在评估中要考虑对整个 pipeline 的评估。

各部分性能评估

Fine-tuning the Detector

上面讨论 Detector 网络进行预训练和 fine tune 训练，这里对比了这两种性能，具体结果以下图所示：

其中是在 Strecha 数据集上进行测试的。看出来只进行预训练性能已经比较好了，fine tune 后性能仍是有一些提升的。

并且能够看出来在 Piccadilly 数据集上训练性能比 Roman-Forum 数据集上训练要好一些。主要是由于 Roman-Forum 数据集上没有不少的非特征点区域，也就是说 Detector 训练过程负样本是不足的，在训练过程很容易 over-fitting。

各部分性能

各个部分性能对好比下图所示：

其中是在 Strecha 数据集上进行测试的。

经过上面发现用本文方法替换 SIFT 中的每一个部分都会有提升。并且主要是用本文的 Detector 网络替换 SIFT 的检测方法不只对 Rep 性能有提升，并且对 NN mAP 和 M. Score 都有提升，这说明本文方法不只能够正确找到特征点位置，并且能找到更利于描述符匹配的位置。同时说明整个 pipeline 一块儿训练来讲是最优的方案。

本文同步分享在博客“无比机智的永哥”（CSDN）。
若有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一块儿分享。