优秀！港大同济伯克利提出Sparse R-CNN: 目标检测新范式

做者孙培泽git

转自知乎，已获受权转载，请勿二次转载github

https://zhuanlan.zhihu.com/p/310058362windows

本文主要介绍一下咱们最近的一篇工做：

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

沿着目标检测领域中 Dense 和 Dense-to-Sparse 的框架，Sparse R-CNN创建了一种完全的 Sparse 框架，脱离 anchor box，reference point，Region Proposal Network(RPN)等概念，无需Non-Maximum Suppression(NMS) 后处理，在标准的 COCO benchmark 上使用 ResNet-50 FPN 单模型在标准 3x training schedule 达到了 44.5 AP 和 22 FPS。微信

论文连接：https://msc.berkeley.edu/research/autonomous-vehicle/sparse_rcnn.pdf网络
项目连接：https://github.com/PeizeSun/SparseR-CNNapp

01 框架

Motivation编辑器

咱们先简单回顾一下目标检测领域中主流的两大类方法。

第一大类是从非Deep时代就被普遍应用的dense detector，例如DPM，YOLO，RetinaNet，FCOS。在dense detector中，大量的object candidates例如sliding-windows，anchor-boxes， reference-points等被提早预设在图像网格或者特征图网格上，而后直接预测这些candidates到gt的scaling/offest和物体类别。
第二大类是dense-to-sparse detector，例如，R-CNN家族。这类方法的特色是对一组sparse的candidates预测回归和分类，而这组sparse的candidates来自于dense detector。

这两类框架推进了整个领域的学术研究和工业应用。目标检测领域看似已经饱和，然而dense属性的一些固有局限总让人难以满意：

NMS 后处理
many-to-one 正负样本分配
prior candidates的设计

因此，一个很天然的思考方向就是：能不能设计一种完全的sparse框架？最近，DETR给出了一种sparse的设计方案。

candidates是一组sparse的learnable object queries，正负样本分配是one-to-one的optimal bipartite matching，无需nms直接输出最终的检测结果。

然而，DETR中每一个object query都和全局的特征图作attention交互，这本质上也是dense。

而咱们认为，sparse的检测框架应该体如今两个方面：sparse candidates和sparse feature interaction。基于此，咱们提出了Sparse R-CNN。

Sparse R-CNN抛弃了anchor boxes或者reference point等dense概念，直接从a sparse set of learnable proposals出发，没有NMS后处理，整个网络异常干净和简洁，能够看作是一个全新的检测范式。

02 函数

Sparse R-CNN性能

Sparse R-CNN的object candidates是一组可学习的参数，N*4，N表明object candidates的个数，通常为100～300，4表明物体框的四个边界。这组参数和整个网络中的其余参数一块儿被训练优化。

That's it，彻底没有dense detector中成千上万的枚举。这组sparse的object candidates做为proposal boxes用以提取Region of Interest(RoI)，预测回归和分类。

这组学习到的proposal boxes能够理解为图像中可能出现物体的位置的统计值，这样coarse的表征提取出来的RoI feature显然不足以精肯定位和分类物体。

因而，咱们引入一种特征层面的candidates，proposal features，这也是一组可学习的参数，N*d，N表明object candidates的个数，与proposal boxes一一对应，d表明feature的维度，通常为256。

这组proposal features与proposal boxes提取出来的RoI feature作一对一的交互，从而使得RoI feature的特征更有利于定位和分类物体。

相比于原始的2-fc Head，咱们的设计称为Dynamic Instance Interactive Head.

Sparse R-CNN的两个显著特色就是sparse object candidates和sparse feature interaction，既没有dense的成千上万的candidates，也没有dense的global feature interaction。Sparse R-CNN能够看做是目标检测框架从dense到dense-to-sparse到sparse的一个方向拓展。

Architecture Design

Sparse R-CNN的网络设计原型是R-CNN家族。

Backbone是基于ResNet的FPN。
Head是一组iterative的Dynamic Instance Interactive Head，上一个head的output features和output boxes做为下一个head的proposal features和proposal boxes。Proposal features在与RoI features交互以前作self-attention。
训练的损失函数是基于optimal bipartite matching的set prediction loss。

从Faster R-CNN(40.2 AP)出发，直接将RPN替换为a sparse set of learnable proposal boxes，AP降到18.5；引入iterative结构提高AP到32.2；引入dynamic instance interaction最终提高到42.3 AP。

Performance

咱们沿用了Detectron2的3x training schedule，所以将Sparse R-CNN和Detectorn2中的detectors作比较（不少方法没有报道3x的性能，因此没有列出)。

同时，咱们也列出了一样不须要NMS后处理的DETR和Deformable DETR的性能。Sparse R-CNN在检测精度，推理时间和训练收敛速度都展示了至关有竞争力的性能。

Conclusion

R-CNN和Fast R-CNN出现后的一段时期内，目标检测领域的一个重要研究方向是提出更高效的region proposal generator。Faster R-CNN和RPN做为其中的佼佼者展示出普遍而持续的影响力。

Sparse R-CNN首次展现了简单的一组可学习的参数做为proposal boxes便可达到comparable的性能。咱们但愿咱们的工做可以带给你们一些关于end-to-end object detection的启发。

备注：目标检测

目标检测交流群

2D、3D目标检测等最新资讯，若已为CV君其余帐号好友请直接私信。

我爱计算机视觉

微信号:aicvml

QQ群:805388940

微博知乎:@我爱计算机视觉

投稿:amos@52cv.net

网站:www.52cv.net

点点【在看】分享技术成果

本文分享自微信公众号 - 我爱计算机视觉（aicvml）。
若有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一块儿分享。