You Only Look Once: Unified, Real-Time Object Detection (YOLO 论文翻译)


英文版论文原文:https://arxiv.org/pdf/1506.02640.pdfhtml


You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon,Santosh Divvala ,Ross Girshick& Ali Farhadiios

  • University of Washington
  • Allen Institute for AI
  • Facebook AI Research

Abstract

提出了一种新的目标检测方法YOLO。 先前关于对象检测的工做从新定义分类器来执行检测。相反,咱们将对象检测框架为一个回归问题,回归到空间分隔的边界框和相关的类几率。 在一次评估中,单个神经网络直接从完整的图像中预测边界框和类几率。因为整个检测管道是一个单一的网络,能够直接对检测性能进行端到端优化。web

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.算法

咱们的统一架构很是快。咱们的基本YOLO模型以每秒45帧的速度实时处理图像。 较小的网络Fast YOLO每秒可处理惊人的155帧,同时仍可实现其余实时检测器的mAP两倍的性能。 与最早进的检测系统相比,YOLO会产生更多的定位错误,但不太可能在背景上预测误报。最后,YOLO学习对象的通常表示。当从天然图像扩展到其余领域(如艺术品)时,它的性能优于其余检测方法,包括DPM和R-CNN。windows

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.网络

1. Introduction

人们看了一眼图像,当即知道图像中有什么对象,它们在哪里以及它们如何相互做用。 人类的视觉系统快速准确,使咱们可以执行一些复杂的任务,例如在没有意识的状况下驾驶。 快速,准确的物体检测算法将使计算机无需专用传感器便可驾驶汽车,使辅助设备向人类用户传达实时场景信息,并释放通用响应型机器人系统的潜力。架构

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.app

当前的检测系统从新利用分类器来执行检测。 为了检测物体,这些系统采用了该物体的分类器,并在测试图像的各个位置和比例上对其进行了评估。 像可变形零件模型(DPM)之类的系统使用滑动窗口方法,其中分类器在整个图像上均匀分布的位置运行[10]。框架

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].less

诸如R-CNN的最新方法使用区域提议方法,首先在图像中生成潜在的边界框,而后在这些提议的框上运行分类器。 分类后,使用后处理来完善边界框,消除重复检测并根据场景中的其余对象对这些框进行从新评分[13]。 这些复杂的流程缓慢且难以优化,由于每一个单独的组件都必须分别进行培训。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

咱们将目标检测重构为一个单一的回归问题,直接从图像像素获得边界框坐标和类几率。使用咱们的系统,您只须要看一次图像(YOLO),就能够预测当前的对象和它们的位置。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

在这里插入图片描述
图1:YOLO检测系统。 使用YOLO处理图像很是简单明了。 咱们的系统(1)将输入图像的大小调整为448×448,(2)在图像上运行单个卷积网络,而且(3)经过模型的置信度对检测结果进行阈值化。

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.

YOLO很是简单:请参见图1。单个卷积网络可同时预测多个边界框和这些框的类几率。 YOLO训练完整图像并直接优化检测性能。 与传统的对象检测方法相比,此统一模型具备多个优势。

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.

首先,YOLO很是快。 因为咱们将检测框架定为回归问题,所以不须要复杂的流程。 咱们只需在测试时在新图像上运行神经网络便可预测检测结果。 咱们的基本网络以每秒45帧的速度运行,在Titan X GPU上没有批处理,而快速版本的运行速度超过150 fps。 这意味着咱们能够以不到25毫秒的延迟实时处理流视频。 此外,YOLO的平均平均精度是其余实时系统的两倍以上。 有关在网络摄像头上实时运行的系统的演示,请参阅咱们的项目网页:http://pjreddie.com/yolo/。

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.

其次,YOLO在作出预测时会全局考虑图像。 与基于滑动窗口和区域提议的技术不一样,YOLO在训练和测试期间会看到整个图像,所以它隐式地编码有关类及其外观的上下文信息。 fast R-CNN是一种顶部检测方法[14],由于它看不到较大的上下文,所以将图像中的背景色块误认为是对象。 与Fast R-CNN相比,YOLO产生的背景错误少于一半。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

第三,YOLO学习对象的可归纳表示。 当通过天然图像训练并通过艺术品测试时,YOLO在很大程度上优于DPM和R-CNN等顶级检测方法。 因为YOLO具备高度通用性,所以在应用于新域或意外输入时,分解的可能性较小。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on art-work, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.

YOLO在准确性方面仍落后于最新的检测系统。 尽管它能够快速识别图像中的对象,但仍难以精肯定位某些对象,尤为是小的对象。 咱们在实验中进一步研究了这些折衷。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.

咱们全部的训练和测试代码都是开源的。 各类预训练的模型也能够下载。

All of our training and testing code is open source. A variety of pretrained models are also available to download.

2. Unified Detection

咱们将对象检测的各个组成部分统一为一个神经网络。 咱们的网络使用整个图像中的特征来预测每一个边界框。 它还能够同时预测图像全部类的全部边界框。 这意味着咱们的网络会全局考虑整个图像和图像中的全部对象。 YOLO设计可实现端到端的训练和实时速度,同时保持较高的平均精度。

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.

在这里插入图片描述

图2:模型。 咱们的系统将检测建模为回归问题。 它将图像划分为 S × S S×S 网格,并针对每一个网格单元预测B边界框,这些框的置信度和C类几率。 这些预测被编码为 S × S × B 5 + C S×S×(B ∗ 5 + C 张量。

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S × S S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × ( B 5 + C ) S × S × (B ∗ 5 + C) tensor.

咱们的系统将输入图像划分为S×S网格。 若是对象的中心落入网格单元,则该网格单元负责检测该对象。

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

每一个网格单元预测B边界框和这些框的置信度得分。 这些置信度得分反映出该模型对框包含一个对象的信心,以及它认为框预测的准确性。 形式上,咱们将置信度定义为 P r ( O b j e c t ) × I O U p r e d t r u t h Pr(Object) \times IOU^{truth}_{pred} 。 若是该单元格中没有对象,则置信度分数应为零。 不然,咱们但愿置信度得分等于预测框与地面真相之间的联合(IOU)的交集。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as P r ( O b j e c t ) × I O U p r e d t r u t h Pr(Object) \times IOU^{truth}_{pred} . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

每一个边界框包含5个预测: x x , y y , w w , h h 和置信度。 ( x , y ) (x, y) 坐标表示框的中心相对于网格单元格的边界。 宽度和高度是相对于整个图像预测的。 最后,置信度预测表示预测框与任何地面真实框之间的IOU。

Each bounding box consists of 5 predictions: x x , y y , w w , h h , and confidence. The ( x , y ) (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每一个网格单元还预测C条件类的几率 P r ( C l a s s i O b j e c t ) Pr(Class_i|Object) 。 这些几率取决于包含对象的网格单元。 咱们只预测每一个网格单元格的一组类几率,而不考虑框B的数量。

Each grid cell also predicts C conditional class probabilities, P r ( C l a s s i O b j e c t ) Pr(Class_i|Object) . These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

在测试时,咱们将条件类几率和单个框的置信度预测相乘,

At test time we multiply the conditional class probabilities and the individual box confidence predictions,

P r ( C l a s s i O b j e c t ) P r ( O b j e c t ) I O U p r e d t r u t h = P r ( C l a s s i ) I O U p r e d t r u t h (1) \tag{1} Pr(Class_i|Object) ∗ Pr(Object) ∗ IOU^{truth}_{pred} = Pr(Class_i) ∗ IOU^{truth}_{pred}

这给了咱们每一个框的特定类别的置信度分数。这些分数既编码了类出如今框中的几率,也编码了预测框与对象的匹配程度。

which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

为了评估PASCAL VOC上的YOLO,咱们使用 S = 7 S = 7 , B = 2 B = 2 。 PASCAL VOC具备20个标记的类,所以 C = 20 C = 20 。 咱们的最终预测是张量 7 × 7 × 30 7 × 7 × 30

For evaluating YOLO on PASCAL VOC, we use S = 7 S = 7 , B = 2 B = 2 . PASCAL VOC has 20 labelled classes so C = 20 C = 20 . Our final prediction is a 7 × 7 × 30 7 × 7 × 30 tensor.

2.1. Network Design

咱们将该模型实现为卷积神经网络,并在PASCAL VOC检测数据集上对其进行评估[9]。 网络的初始卷积层从图像中提取特征,而彻底链接的层则预测输出几率和坐标。

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

咱们的网络架构受GoogLeNet模型进行图像分类的启发[34]。 咱们的网络有24个卷积层,其后是2个彻底链接的层。 除了使用GoogLeNet的初始模块外,咱们只使用1×1归约层,而后使用3×3卷积层,相似于Lin等[22]。 完整的网络如图3所示。

Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

咱们还训练了一种快速版本的YOLO,旨在突破快速物体检测的界限。 Fast YOLO使用的神经网络具备较少的卷积层(从9个而不是24个),而且这些层中的过滤器较少。 除了网络的规模外,YOLO和Fast YOLO之间的全部训练和测试参数都相同。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.

在这里插入图片描述

图3:架构。 咱们的检测网络有24个卷积层,而后是2个彻底链接的层。 交替使用 1 × 1 1×1 卷积层会减小特征层与上一层之间的距离。 咱们以一半的分辨率( 224 × 224 224×224 输入图像)对ImageNet分类任务中的卷积层进行预训练,而后将分辨率加倍以进行检测。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution ( 224 × 224 224 × 224 input image) and then double the resolution for detection.

咱们网络的最终输出是预测的 7 × 7 × 30 7 × 7 × 30 张量。

The final output of our network is the 7 × 7 × 30 7 × 7 × 30 tensor of predictions.

2.2. Training

咱们在ImageNet 1000类竞赛数据集上对卷积层进行预训练[30]。 对于预训练,咱们使用图3中的前20个卷积层,而后是平均池层和彻底链接层。 咱们对该网络进行了大约一周的训练,并在ImageNet 2012验证集上达到了单做物top-5的准确性,达到88%,与Caffe模型动物园[24]中的GoogLeNet模型至关。 咱们将Darknet框架用于全部训练和推断[26]。

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].

而后,咱们将模型转换为执行检测。 任等人。 代表将卷积层和链接层都添加到预训练的网络中能够提升性能[29]。 按照他们的示例,咱们添加了四个卷积层和两个彻底链接的层,它们具备随机初始化的权重。 检测一般须要细粒度的视觉信息,所以咱们将网络的输入分辨率从 224 × 224 224 × 224 增长到 448 × 448 448 × 448

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 224 × 224 to 448 × 448 448 × 448 .

咱们的最后一层能够预测类几率和边界框坐标。 咱们经过图像的宽度和高度对包围盒的宽度和高度进行归一化,使其介于0和1之间。咱们将包围盒的 x x y y 坐标参数化为特定网格单元位置的偏移量,以便对它们进行限制 在0和1之间。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x x and y y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

咱们对最终层使用线性激活函数,而全部其余层均使用如下泄漏校订线性激活:

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

ϕ ( x ) = { x , if  x > 0 0.1 x Otherwise (2) \tag{2} \phi(x)= \begin{cases}x, &\text{if } x>0 \\0.1x &\text{Otherwise}\end{cases}

咱们针对模型输出中的平方和偏差进行了优化。 咱们使用平方和偏差是由于它易于优化,可是它与咱们实现平均精度最大化的目标并不彻底一致。 它对定位偏差和分类偏差的权重相等,这可能不理想。 一样,在每一个图像中,许多网格单元都不包含任何对象。 这会将这些单元格的“置信度”得分推向零,一般会超过确实包含对象的单元格的梯度。 这可能会致使模型不稳定,从而致使训练在早期就出现分歧。

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

为了解决这个问题,对于不包含对象的盒子,咱们增长了边界框坐标预测的损失,并减小了置信度预测的损失。 咱们使用两个参数 λ c o o r d λ_{coord} λ n o o b j λ_{noobj} 来完成此操做。 咱们设置 λ c o o r d = 5 λ_{coord} = 5 λ n o o b j = . 5 λ_{noobj} = .5

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λ c o o r d λ_{coord} and λ n o o b j λ_{noobj} to accomplish this. We set λ c o o r d = 5 λ_{coord} = 5 and λ n o o b j = . 5 λ_{noobj} = .5 .

平方和偏差也平均权衡大盒子和小盒子中的偏差。 咱们的偏差度量标准应该反映出,大盒子中的小误差比小盒子中的小误差要小。 为了部分解决此问题,咱们预测边界框的宽度和高度的平方根,而不是直接预测宽度和高度。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

YOLO预测每一个网格单元有多个边界框。 在训练时,咱们只但愿一个边界框预测变量对每一个对象负责。 咱们将一个预测变量指定为“负责任的”预测对象,基于哪一个预测具备最高的当前IOU和真实性。 这致使边界框预测变量之间的专用化。 每一个预测器均可以更好地预测某些大小,纵横比或对象类别,从而改善整体召回率。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

在培训期间,咱们优化了如下多部分损失功能:

During training we optimize the following, multi-part loss function:

在这里插入图片描述

其中 1 i o b j 1^{obj}_i 表示对象是否出如今单元格 i i 中,而 1 i j o b j 1^{obj}_{ij} 表示单元格 i i 中的第 j j 个边界框预测变量对该预测“负责”。

where 1 i o b j 1^{obj}_i denotes if object appears in cell i i and 1 i j o b j 1^{obj}_{ij} denotes that the j j th bounding box predictor in cell i i is “responsible” for that prediction.

请注意,若是该网格单元中存在对象,则损失函数只会惩罚分类错误(所以,前面讨论过的条件分类几率)。 若是该预测变量对地面真值框“负责”(即该网格单元中任何预测变量的IOU最高),它也只会惩罚边界框坐标偏差。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

咱们根据PASCAL VOC 2007和2012的培训和验证数据集对网络进行了135个时期的培训。在2012年进行测试时,咱们还包含了VOC 2007测试数据进行培训。 在整个训练过程当中,咱们使用的批次大小为64,动量为0:9,衰减为0:0005。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0:9 and a decay of 0:0005.

咱们的学习率时间表以下:在第一个时期,咱们将学习率从 1 0 3 10^{−3} 逐渐提升到 1 0 2 10^{−2} 。 若是咱们以较高的学习率开始,则因为不稳定的梯度,咱们的模型常常会发散。 咱们继续使用 1 0 2 10^{−2} 进行75个时期的训练,而后使用 1 0 3 10^{−3} 进行30个时期的训练,最后使用 1 0 4 10^{−4} 进行30个时期的训练。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 1 0 3 10^{−3} to 1 0 2 10^{−2} . If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 1 0 2 10^{−2} for 75 epochs, then 1 0 3 10^{−3} for 30 epochs, and finally 1 0 4 10^{−4} for 30 epochs.

为了不过分拟合,咱们使用了辍学和普遍的数据扩充。 在第一个链接的层以后,速率为.5的退出层可防止层之间的共适应[18]。 对于数据扩充,咱们引入了随机缩放和最多原始图像大小20%的转换。 咱们还将在HSV颜色空间中将图像的曝光和饱和度随机调整至1:5。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1:5 in the HSV color space.

2.3. Inference

就像在训练中同样,预测测试图像的检测仅须要进行一次网络评估。 在PASCAL VOC上,网络能够预测每一个图像98个边界框以及每一个框的类几率。 与基于分类器的方法不一样,YOLO只须要进行一次网络评估,所以测试时间很是快。

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

网格设计在边界框预测中强制执行空间分集。 一般,很明显,一个对象属于哪一个网格单元,而且网络仅为每一个对象预测一个框。 可是,多个单元格能够很好地定位某些大型物体或多个单元格边界附近的物体。 非最大抑制可用于修复这些屡次检测。 尽管对于R-CNN或DPM而言,对性能并不重要,但非最大抑制会在mAP中增长2-3%。

The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2- 3% in mAP.

2.4. Limitations of YOLO

因为每一个网格单元仅预测两个框且只能具备一个类别,所以YOLO对边界框的预测施加了强大的空间约束。 这种空间限制限制了咱们的模型能够预测的附近物体的数量。 咱们的模型与成组出现的小物体(例如成群的鸟)做斗争。

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

因为咱们的模型学会了根据数据预测边界框,所以很难将其推广到具备新的或不一样的长宽比或配置的对象。 咱们的模型还使用相对粗略的特征来预测边界框,由于咱们的体系结构具备来自输入图像的多个下采样层。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

最后,虽然咱们训练的是近似检测性能的损失函数,但损失函数在小边界框与大边界框中对待错误的方式相同。 大盒中的小错误一般是良性的,但小盒中的小错误对IOU的影响更大。 错误的主要来源是错误的本地化。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

3. Comparison to Other Detection Systems

对象检测是计算机视觉中的核心问题。 检测管线一般从输入图像中提取一组鲁棒特征开始(Haar [25],SIFT [23],HOG [4],卷积特征[6])。 而后,使用分类器[3六、2一、1三、10]或定位器[一、32]识别特征空间中的对象。 这些分类器或定位器以滑动窗口的方式在整个图像上或图像的某些区域子集上运行[3五、1五、39]。 咱们将YOLO检测系统与几个顶级检测框架进行了比较,突出了关键的异同。

Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.

变形零件模型。 可变形零件模型(DPM)使用滑动窗口方法进行对象检测[10]。 DPM使用不相交的管道来提取静态特征,对区域进行分类,预测高分区域的边界框等。咱们的系统用单个卷积神经网络替换了全部这些不一样的部分。 网络同时执行特征提取,边界框预测,非最大抑制和上下文推理。 网络代替静态功能,而是在线训练功能并针对检测任务对其进行优化。 与DPM相比,咱们的统一体系结构可致使更快,更准确的模型。

Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.

R-CNN。 R-CNN及其变体使用区域提议而不是滑动窗口来查找图像中的对象。 选择性搜索[35]生成潜在的边界框,卷积网络提取特征,SVM对这些框进行评分,线性模型调整边界框,非最大抑制消除重复的检测。 这个复杂的流水线的每一个阶段都必须独立地精确调整,而且结果系统很是慢,在测试时间每一个图像花费40秒钟以上的时间[14]。

R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].

YOLO与R-CNN有一些类似之处。 每一个网格单元都会提出潜在的边界框,并使用卷积特征对这些框进行评分。 可是,咱们的系统在网格单元建议上施加了空间限制,这有助于减轻对同一对象的屡次检测。 咱们的系统还提出了更少的边界框,每一个图像只有98个边界框,而选择性搜索的边界框只有2000个。 最后,咱们的系统将这些单独的组件组合为一个共同优化的模型。

YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.

其余快速检测器愈来愈快的R-CNN专一于经过共享计算和使用神经网络而非选择搜索来提议区域来加快R-CNN框架[14] [28]。 尽管它们在R-CNN上提供了速度和准确性方面的改进,但二者仍不足以实现实时性能。

Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.

许多研究工做集中在加速DPM管道[31] [38] [5]。 它们加快了HOG计算,使用级联并将计算推入GPU的速度。 可是,只有30Hz DPM [31]其实是实时运行的。

Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.

YOLO并无尝试优化大型检测管道的各个组件,而是彻底淘汰了该管道,而且设计合理。

Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.

像面孔或人这样的单一类别的检测器能够进行高度优化,由于它们必须处理更少的变化[37]。 YOLO是一种通用检测器,可学会同时检测各类物体。

Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.

深度MultiBox。 与R-CNN不一样,Szegedy等人。 训练卷积神经网络来预测感兴趣区域[8],而不是使用选择性搜索。 MultiBox还能够经过用单个类别预测替换置信度预测来执行单个对象检测。 可是,MultiBox没法执行常规的对象检测,而且仍然只是较大检测管道中的一部分,须要进一步的图像补丁分类。 YOLO和MultiBox都使用卷积网络来预测图像中的边界框,可是YOLO是一个完整的检测系统。

Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.

OverFeat。 Sermanet等。 训练一个卷积神经网络来执行定位,并使该定位器执行检测[32]。 OverFeat有效地执行滑动窗口检测,但它仍然是不相交的系统。 OverFeat针对本地化而不是检测性能进行优化。 像DPM同样,本地化程序只能在进行预测时看到本地信息。 OverFeat没法推理全局上下文,所以须要进行大量后期处理才能产生连贯的检测结果。

OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.

MultiGrasp。 咱们的工做在设计上相似于Redmon等人[27]的抓握检测工做。 咱们用于边界框预测的网格方法是基于MultiGrasp系统进行回归分析的。 可是,抓取检测比对象检测要简单得多。 MultiGrasp只须要为包含一个对象的图像预测单个可抓握区域。 没必要估计物体的大小,位置或边界或预测其类别,仅需找到适合抓握的区域便可。 YOLO预测图像中多个类别的多个对象的边界框和类别几率。

MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.

4. Experiments

首先,咱们将YOLO与PASCAL VOC 2007上的其余实时检测系统进行比较。为了了解YOLO和R-CNN变体之间的区别,咱们探究了YOLO和Fast R-CNN(Vista的最高性能版本之一)在VOC 2007上的错误。 R-CNN [14]。 基于不一样的错误配置文件,咱们证实了YOLO可用于对快速R-CNN检测进行评分,并减小背景假阳性引发的错误,从而显着提升性能。 咱们还将介绍VOC 2012的结果,并将mAP与当前的最新方法进行比较。 最后,咱们证实了YOLO在两个艺术品数据集上比其余检测器能更好地推广到新领域。

First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.

4.1. Comparison to Other Real-Time Systems

对象检测方面的许多研究工做都集中在快速创建标准检测管道上。 [5] [38] [31] [14] [17] [28]可是,只有Sadeghi等人才知道。 实际上产生了一个实时运行的检测系统(每秒30帧或更高)[31]。 咱们将YOLO与他们以30Hz或100Hz运行的DPM的GPU实现进行了比较。 尽管其余工做还没有达到实时里程碑,但咱们还比较了它们的相对mAP和速度,以检查对象检测系统中可用的精度-性能折衷。

Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17] [28] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.

快速YOLO是PASCAL上最快的对象检测方法。 据咱们所知,它是现存最快的物体检测器。 凭借52:7%的mAP,它的准确度是之前实时检测工做的两倍以上。 YOLO将mAP提高至63:4%,同时仍保持实时性能。

Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52:7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63:4% while still maintaining real-time performance.

咱们还使用VGG-16训练YOLO。 该模型比YOLO更准确,但速度也要慢得多。 与其余依赖VGG-16的检测系统相比,它颇有用,可是因为它比实时检测慢,所以本文的其他部分将重点放在咱们更快的模型上。

We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.

最快的DPM能够在不牺牲不少mAP的状况下有效地加快DPM的速度,可是它仍将实时性能下降了2倍[38]。 与神经网络方法相比,它还受到DPM检测精度相对较低的限制。

Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.

R-CNN减R用静态边界框建议替换“选择性搜索” [20]。 尽管它比R-CNN快得多,但它仍然缺少实时性,而且因为没有好的建议而对准确性形成重大影响。

R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.

在这里插入图片描述

表1:PASCAL VOC 2007上的实时系统。 比较快速检测器的性能和速度。 Fast YOLO是有记录的用于PASCAL VOC检测的最快的检测器,仍然是任何其余实时检测器的两倍。 YOLO比快速版本的精度高10 mAP,但仍远远高于实时速度。

Table 1: Real-Time Systems on PASCAL VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.

快速的R-CNN能够加快R-CNN的分类速度,可是它仍然依赖于选择性搜索,每一个图像可能须要2秒钟左右的时间来生成边界框建议。 所以,它具备较高的mAP,但在0:5 fps时,仍离实时性还很远。

Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0:5 fps it is still far from realtime.

最近的Faster R-CNN用神经网络取代了选择性搜索,以提出边界框,相似于Szegedy等。 [8]在咱们的测试中,他们最准确的模型达到了7 fps,而较小的,精度较低的模型则以18 fps运行。 Faster R-CNN的VGG-16版本高出10 mAP,但比YOLO慢6倍。 Zeiler-Fergus Faster R-CNN仅比YOLO慢2.5倍,但准确性也较低。

The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler-Fergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.

4.2. VOC 2007 Error Analysis