YoLo: You Only Look Once: Unified, Real-Time Object Detection译文

Abstract摘要

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
本文介绍YOLO,这是一种新的物体检测方法。之前的物体检测工做须要从新设置分类器来执行检测。相反,咱们将对象检测框架化为空间分离边界框和相关类别几率的回归问题。只须要一个神经网络在一次评估中就能够直接从完整图像预测边界框和类几率。因为整个检测流水线是单个网络,所以能够直接针对检测性能端到端地进行优化。ios

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.
本文定义的架构很是快,基础的YOLO模型可以以45帧/秒的速度实时处理图像。较小版本的网络模型Fast YOLO每秒处理速度可以高达惊人的155帧,其mAP也是其它实时检测器的mAP的两倍。与最早进的检测系统相比,YOLO会产生更多的定位偏差,但不太可能预测背景上的误报。最后,YOLO学习对象的很是通常的表示。它比其余检测方法(包括DPM和R-CNN)在从天然图像推广到其余领域时更胜一筹。git

1. Introduction 介绍

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact. The human visual system is fast and accurate, allowing us to perform complex tasks like driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
人眼一瞥图像就能当即知道图像中的物体,它们在哪里以及是如何相互做用。人类的视觉系统是快速和准确的,使得咱们可以执行复杂的任务,例如驾驶时几乎不多须要没有意识。快速,准确的目标检测算法可让计算机在没有专门传感器的状况下驾驶汽车,帮助辅助设备向人类用户传达实时的场景信息,并释放通用目标响应性机器人系统的潜力。github

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [10].
目前的检测系统从新利用分类器来执行检测。为了检测一个对象,这些系统为这个对象提供一个分类器,并在不一样的位置对它进行评估,并在测试图像中进行缩放。像可变形零件模型(DPM)这样的系统使用滑动窗口方法,其中分类器在整个图像上均匀分布的位置运行[10]。web

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.
最近的方法,如R-CNN使用区域候选方法首先在图像中生成潜在的边界框,而后在这些提出的边界框上运行分类器。分类后,后处理用于修改边界框,消除重复检测,并根据场景中的其余对象从新设置框[13]。这些复杂的管道很慢而且难以优化,由于每一个单独的组件都必须单独进行培训。
图像
Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to , (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
图1:YOLO检测系统。用YOLO处理图像简单而直接。咱们的系统(1)将输入图像调整为,(2)在图像上运行单个卷积网络,以及(3)经过模型的置信度对结果检测进行阈值。算法

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
咱们将对象检测从新设计为单一回归问题,从图像像素到边界框坐标和类几率。使用咱们的系统,只须要在图像上看一次(YOLO),就能预测出现物体的位置和位置。windows

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance. This unified model has several benefits over traditional methods of object detection.
YOLO很是简单:参见图1.单个卷积网络可同时预测这些盒子的多个边界框和类几率。YOLO训练全图像并直接优化检测性能。这种统一的模型与传统的物体检测方法相比有许多优势。api

  • First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our project webpage: http://pjreddie.com/yolo/.
    首先,YOLO的速度很是快。因为咱们将检测视为回归问题,所以咱们不须要复杂的管道。咱们只是在测试的时候在一幅新图像上运行咱们的神经网络来预测检测结果。咱们的基础网络在Titan X GPU上以每秒45帧的速度运行,而快速版本fast yolo运行速度超过150 fps。这意味着咱们能够在不到25毫秒的延迟时间内实时处理流媒体视频。此外,YOLO实现了其余实时系统平均精度的两倍以上。有关咱们的系统在网络摄像头上实时运行的演示,请参阅咱们的项目网页:http://pjreddie.com/yolo/。网络

  • Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.
    其次,在进行预测时,YOLO会在全范围内对图像产生预测。与基于滑动窗口和区域提出的技术不一样,YOLO在训练和测试时间期间看到整个图像,所以它隐式地编码关于类的上下文信息以及它们的外观。fast R-CNN,一种顶级的检测方法[14],因为没法看到较大的上下文,所以在图像中出现对象北京的误判。与Fast R-CNN相比,YOLO的背景错误数量少了一半。架构

  • Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected inputs.
    第三,YOLO学习了物体的可归纳表征。在对天然图像进行培训并对艺术做品进行测试时,YOLO大幅优于DPM和R-CNN等顶级检测方法。因为YOLO具备高度归纳性,所以在应用于新域或意外输入时不太可能发生故障。app

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones. We examine these tradeoffs further in our experiments.
YOLO在准确性方面仍落后于最早进的检测系统。虽然它能够快速识别图像中的物体,但它正努力精肯定位某些物体,尤为是小物体。咱们在实验中进一步检查了这些折衷。

All of our training and testing code is open source. A variety of pretrained models are also available to download.
咱们全部的培训和测试代码都是开源的。各类预训练模型也能够下载。

2. Unified Detection 统一检测

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
咱们将对象检测的单独组件集成到单个神经网络中。咱们的网络使用整个图像的特征来预测每一个边界框。它还同时预测图像中全部类的全部边界框。这意味着咱们的网络在全球范围内关于整个图像和图像中的全部对象的缘由。YOLO设计可实现端到端培训和实时速度,同时保持较高的平均精度。
Our system divides the input image into an grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
咱们的系统将输入图像分为网格。若是对象的中心落入网格单元格中,则该网格单元格负责检测该对象。
Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Object IOUtruthpred. If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
每一个网格单元预测这些框的B边界框和置信度分数。这些置信度分数反映了该模型如何肯定盒子是否包含一个对象,以及它认为该盒子是如何准确预测的。在形式上,咱们将自信定义为Object IOUTRUTHpred。若是该单元中不存在对象,则置信度分数应为零。不然,咱们但愿置信度得分等于预测框与地面真值之间的联合(IOU)交点。
Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.
每一个边界框由5个预测组成:x,y,w,h和置信度。坐标表示相对于网格单元边界的框的中心。宽度和高度是相对于整个图像预测的。最后,置信度预测表明预测框与任何地面实况框之间的欠条。
Each grid cell also predicts C conditional class probabilities, Class Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.
每一个网格单元还预测C条件类几率,类对象)。这些几率取决于包含对象的网格单元。咱们只预测每一个网格单元的一组类别几率,而无论箱子的数量是多少。
At test time we multiply the conditional class probabilities and the individual box confidence predictions,
在测试时间,咱们乘以条件类几率和单个盒子的置信度预测,
Class Object Object IOUtruthpred Class IOUtruthpred (1)
对象类对象 IOUtruthpred 类 IOUtruthpred(1)
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.
这为咱们提供了每一个盒子的特定类别信心评分。这些分数编码该类出如今框中的几率以及预测框与对象的吻合程度。

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an tensor.
图2:模型。咱们的系统将检测模型化为回归问题。它将图像划分为网格,而且每一个网格单元预测B个边界框,对这些框的置信度以及C类几率。这些预测被编码为张量。
For evaluating YOLO on PASCAL VOC, we use , . PASCAL VOC has 20 labelled classes so . Our final prediction is a tensor.
为了在PASCAL VOC上评估YOLO,咱们使用, 。PASCAL VOC有20个标签类,所以。咱们的最终预测是一个张量。
2.1. Network Design2.1。网络设计
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
咱们将此模型做为卷积神经网络实施并在PASCAL VOC检测数据集上进行评估[9]。网络的初始卷积层从图像中提取特征,而彻底链接的层预测输出几率和坐标。
Our network architecture is inspired by the GoogLeNet model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use reduction layers followed by convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.
咱们的网络架构受到GoogLeNet图像分类模型的启发[34]。咱们的网络有24个卷积层,其次是2个彻底链接的层。咱们不使用GoogLeNet使用的初始模块,而是使用减小层次,而后是卷积层,相似于Lin等[22]。完整的网络如图3所示。
We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
咱们还培训快速版本的YOLO,旨在推进快速对象检测的边界。快YOLO使用较少卷积层的神经网络(9个而不是24个),并在这些层中使用更少的滤波器。除了网络规模以外,YOLO和Fast YOLO之间的全部培训和测试参数都相同。

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution ( input image) and then double the resolution for detection.
图3:架构。咱们的检测网络有24个卷积层,其次是2个彻底链接的层。交替的卷积层减小了前一层的特征空间。咱们在分辨率的一半(输入图像)上对ImageNet分类任务的卷积层进行预处理,而后将分辨率加倍以进行检测。
The final output of our network is the tensor of predictions.
咱们网络的最终输出是张量预测。
2.2. Training2.2。训练
We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24]. We use the Darknet framework for all training and inference [26].
咱们在ImageNet 1000级竞争数据集上预编译咱们的卷积层[30]。对于预训练,咱们使用图3中的前20个卷积层,接着是平均池层和彻底链接层。咱们对这个网络进行了大约一周的训练,而且在ImageNet 2012验证集上得到了88%的单一裁剪精度,堪比Caffe模型动物园中的GoogLeNet模型[24]。咱们使用Darknet框架进行全部训练和推理[26]。
We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from to .
而后,咱们将模型转换为执行检测。 Ren等人代表,将卷积层和链接层添加到预训练网络能够提升性能[29]。按照他们的例子,咱们添加四个卷积层和两个彻底链接的层,而且随机初始化权重。检测一般须要细粒度的视觉信息,所以咱们将网络的输入分辨率从增长到。
Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
咱们的最终层预测类几率和边界框坐标。咱们经过图像的宽度和高度规范边界框的宽度和高度,使它们落在0和1之间。咱们将边界框x和y坐标参数化为特定网格单元位置的偏移量,所以它们也在0和1之间有界。
We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:
咱们对最终层使用线性激活函数,而其余全部层使用如下漏整数线性激活:

if otherwise (2)
若是不然(2)
We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
咱们优化了模型输出中的平方和偏差。咱们使用平方和偏差,由于它很容易进行优化,但它并不彻底符合咱们最大化平均精度的目标。它将局部化偏差与分类偏差同等加权,这可能并不理想。并且,在每一个图像中,许多网格单元不包含任何对象。这将这些单元格的“置信度”分数推向零,一般压倒了包含对象的单元格的渐变。这可能致使模型不稳定,从而致使训练早期发生分歧。
To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set λcoord and λnoobj .
为了弥补这一点,咱们增长了边界框坐标预测的损失,并减小了不包含对象的框的置信度预测损失。咱们使用两个参数λcoord和λnoobj来实现这一点。咱们设置了λCOORD和λnoobj。
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
平方和偏差也一样对大盒子和小盒子中的偏差进行加权。咱们的偏差度量应该反映出,大箱子中的小误差比小箱子中的小误差要小。为了部分解决这个问题,咱们直接预测边界框宽度和高度的平方根,而不是宽度和高度。
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
YOLO预测每一个网格单元的多个边界框。在训练时,咱们只须要一个边界框预测器对每一个对象负责。咱们将一个预测变量指定为“负责任”,以根据哪一个预测具备当前IOU最高的IOU与基础事实来预测对象。这致使了边界框预测器之间的专业化。每一个预测变量能够更好地预测某些尺寸,纵横比或对象类别,从而改善总体回忆。
During training we optimize the following, multi-part loss function:
在训练期间,咱们优化如下多部分损失函数:

where 1obji denotes if object appears in cell i and 1objij denotes that the jth bounding box predictor in cell i is “responsible” for that prediction.
其中1obji表示对象是否出如今单元格i中,1objij表示单元格i中的第j个边界框预测器对该预测“负责”。
Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).
请注意,若是对象存在于该网格单元中(所以前面讨论的条件类几率),则丢失函数仅惩罚分类错误。若是该预测器对地面真值盒“负责”(即,具备该网格单元中的任何预测器的最高IOU),则它也仅惩罚边界框坐标错误。
We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and
咱们在PASCAL VOC 2007和2007的训练和验证数据集上对网络进行了大约135个时期的培训
2012. When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
2012年进行测试时,咱们还将包含VOC 2007测试数据进行培训。在整个培训过程当中,咱们使用64的批量,0.9的动量和0.0005的衰减。
Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from to . If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with for 75 epochs, then for 30 epochs, and finally for 30 epochs.
咱们的学习率计划以下:对于第一个时代,咱们慢慢地将学习率从提升到。若是咱们从高学习率开始,咱们的模型一般会因为不稳定的梯度而发散。咱们继续使用训练75个时期,而后使用训练30个时期,最后使用训练30个时期。
To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
为避免过分填充,咱们使用丢失和普遍的数据加强。在第一个链接层以后,速率为.5的丢失层防止了层之间的协调[18]。对于数据加强,咱们引入随机缩放和翻译高达20%的原始图像大小。咱们还在HSV色彩空间中随机调整图像的曝光和饱和度达1.5倍。
2.3. Inference2.3。推理
Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.
就像在培训中同样,预测测试图像的检测只须要一个网络评估。在PASCAL VOC上,网络预测每一个图像的98个边界框和每一个框的类几率。YOLO在测试时间速度很是快,由于它只须要一个网络评估,而不像基于分类器的方法。
The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 23% in mAP.
网格设计强化了边界框预测中的空间多样性。一般很清楚一个对象所在的网格单元是什么,网络仅为每一个对象预测一个盒子。可是,多个单元边界附近的一些大对象或对象能够很好地被多个单元定位。非最大抑制能够用来肯定这些多重检测。对于R-CNN或DPM而言,对性能不是相当重要的,非最大抑制在mAP中增长了23%。
2.4. Limitations of YOLO2.4。 YOLO的限制
YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.
YOLO对边界框预测强加空间约束,由于每一个网格单元只能预测两个方框,而且只能有一个类。这种空间约束限制了咱们的模型能够预测的附近物体的数量。咱们的模型与群体中出现的小物体(例如鸟类群)进行斗争。
Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
因为咱们的模型学习从数据中预测边界框,所以它很难将新的或不寻常的纵横比或配置推广到对象。咱们的模型还使用相对粗糙的特征来预测边界框,由于咱们的体系结构具备来自输入图像的多个下采样层。
Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.
最后,当咱们训练一个接近检测性能的损失函数时,咱们的损失函数将小包围盒与大包围盒的偏差相同。大箱子中的小错误一般是良性的,但小箱子中的小错误对IOU有更大的影响。咱们的主要错误来源是不正确的本地化。
3. Comparison to Other Detection Systems3.与其余检测系统的比较
Object detection is a core problem in computer vision. Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features [6]). Then, classifiers [36, 21, 13, 10] or localizers [1, 32] are used to identify objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [35, 15, 39]. We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
目标检测是计算机视觉中的核心问题。检测流水线一般从输入图像(Haar [25],SIFT [23],HOG [4],卷积特征[6])提取一组鲁棒特征开始。而后,分类器[36,21,13,10]或定位器[1,32]用于识别特征空间中的对象。这些分类器或定位器在整个图像上或者在图像中的一些子区域上以滑动窗口方式运行[35,15,39]。咱们将YOLO检测系统与几种顶级检测框架进行比较,突出了关键的类似性和差别。
Deformable parts models. Deformable parts models (DPM) use a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc. Our system replaces all of these disparate parts with a single convolutional neural network. The network performs feature extraction, bounding box prediction, nonmaximal suppression, and contextual reasoning all concurrently. Instead of static features, the network trains the features in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model than DPM.
可变形零件模型。可变形零件模型(DPM)使用滑动窗口方法进行物体检测[10]。DPM使用不相交的管道来提取静态特征,对区域进行分类,预测高评分区域的边界框等。咱们的系统用一个单一的卷积神经网络代替了全部这些不一样的零件。网络同时执行特征提取,边界框预测,非最大抑制和上下文推理。网络不是静态功能,而是在线训练功能并优化检测任务。咱们的统一架构致使比DPM更快,更准确的模型。
R-CNN. R-CNN and its variants use region proposals instead of sliding windows to find objects in images. Selective Search [35] generates potential bounding boxes, a convolutional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max suppression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40 seconds per image at test time [14].
R-CNN。 R-CNN及其变体使用区域提议而不是滑动窗口来查找图像中的对象。选择性搜索[35]生成潜在的边界框,卷积网络提取特征,SVM对框进行评分,线性模型调整边界框,非最大抑制消除重复检测。这个复杂流水线的每一个阶段都必须独立地进行精确调整,所获得的系统很是缓慢,在测试时间每一个图像须要超过40秒[14]。
YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search. Finally, our system combines these individual components into a single, jointly optimized model.
YOLO与R-CNN有一些类似之处。每一个网格单元提出潜在的边界框并使用卷积特征对这些框进行评分。然而,咱们的系统对网格单元提案施加空间限制,这有助于缓解对同一对象的屡次检测。咱们的系统还提出了更少的边界框,每张图像只有98个,而选择性搜索则只有约2000个。最后,咱们的系统将这些单独的组件组合成一个单一的,共同优化的模型
Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computation and using neural networks to propose regions instead of Selective Search [14] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of real-time performance.
其余快速检测器快速和快速R-CNN专一于经过共享计算并使用神经网络来提出区域而不是选择性搜索来加速R-CNN框架[14] [28]。虽然它们提供了比R-CNN更快的速度和准确度改进,但二者仍然达不到实时性能。
Many research efforts focus on speeding up the DPM pipeline [31] [38] [5]. They speed up HOG computation, use cascades, and push computation to GPUs. However, only 30Hz DPM [31] actually runs in real-time.
许多研究工做集中于加快DPM管道[31] [38] [5]。他们加速HOG计算,使用级联,并将计算推送到GPU。可是,实际上只有30Hz的DPM [31]实际运行。
Instead of trying to optimize individual components of a large detection pipeline, YOLO throws out the pipeline entirely and is fast by design.
YOLO不是试图优化大型检测管道的单个组件,而是彻底抛出管道,而且设计速度很快。
Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.
像面孔或人等单个类别的探测器能够高度优化,由于他们必须处理更少的变化[37]。YOLO是一种通用的探测器,能够学习同时探测各类物体。
Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest
Deep MultiBox。与R-CNN不一样,Szegedy等人训练一个卷积神经网络来预测感兴趣的区域
[8] instead of using Selective Search. MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction. However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification. Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection system.
[8]而不是使用选择性搜索。 MultiBox还能够经过用单个类别预测替换置信度预测来执行单个对象检测。然而,MultiBox没法执行通常的对象检测,而且仍然只是一个更大的检测流水线的一部分,须要进一步的图像补丁分类。YOLO和MultiBox都使用卷积网络来预测图像中的边界框,但YOLO是一个完整的检测系统。
OverFeat. Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs sliding window detection but it is still a disjoint system. OverFeat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
OverFeat。 Sermanet et al。训练一个卷积神经网络来执行定位并使定位器执行检测[32]。OverFeat有效地执行滑动窗口检测,但它仍然是一个不相交的系统。OverFeat优化本地化,而不是检测性能。与DPM同样,定位器在进行预测时只能看到本地信息。OverFeat没法推断全局上下文,所以须要进行重要的后处理才能生成一致的检测结果。
MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding boxes and class probabilities for multiple objects of multiple classes in an image.
MultiGrasp。咱们的工做在设计上与Redmon等[27]的抓握检测类似。咱们的网格边界框预测方法基于MultiGrasp系统进行回归分析。然而,抓握检测比物体检测要简单得多。MultiGrasp只须要为包含一个对象的图像预测一个可抓取区域。它没必要估计对象的大小,位置或边界或预测它的类,只需找到适合抓取的区域便可。YOLO预测图像中多个类的多个对象的边界框和类几率。
4. Experiments4.实验
First we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differences between YOLO and R-CNN variants we explore the errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14]. Based on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the errors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, we show that YOLO generalizes to new domains better than other detectors on two artwork datasets.
首先,咱们将YOLO与PASCAL VOC 2007上的其余实时检测系统进行比较。为了理解YOLO和R-CNN变体之间的区别,咱们探索了YOLO和Fast R-CNN制做的VOC 2007的错误,R-CNN是R-CNN的最高性能版本之一[14]。基于不一样的错误配置文件,咱们显示YOLO可用于从新定位快速R-CNN检测并减小背景误报带来的错误,从而显着提高性能。咱们还展现了VOC 2012的结果,并将mAP与当前最早进的方法进行比较。最后,咱们展现YOLO比两个图形数据集上的其余检测器更好地推广到新域。
4.1. Comparison to Other Real-Time Systems4.1。与其余实时系统的比较
Many research efforts in object detection focus on making standard detection pipelines fast. [5] [38] [31] [14] [17]
目标检测方面的许多研究工做都集中在快速制定标准检测流水线上。 [5] [38] [31] [14] [17]
[28] However, only Sadeghi et al. actually produce a detection system that runs in real-time (30 frames per second or better) [31]. We compare YOLO to their GPU implementation of DPM which runs either at 30Hz or 100Hz. While the other efforts don’t reach the real-time milestone we also compare their relative mAP and speed to examine the accuracy-performance tradeoffs available in object detection systems.
[28]然而,只有Sadeghi等。实际上产生了一个实时运行的检测系统(每秒30帧或更好)[31]。咱们将YOLO与其在30Hz或100Hz下运行的DPM的GPU实现进行比较。虽然其余努力没有达到实时里程碑,咱们还会比较它们的相对mAP和速度,以检查物体检测系统中可用的精度 - 性能折衷。
Fast YOLO is the fastest object detection method on PASCAL; as far as we know, it is the fastest extant object detector. With 52.7% mAP, it is more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance.
快速YOLO是PASCAL上最快的对象检测方法;据咱们所知,它是现存最快的物体探测器。有了52.7%的mAP,它比之前的实时检测工做的准确率高出一倍以上。YOLO将mAP推到63.4%,同时仍保持实时性能。
We also train YOLO using VGG-16. This model is more accurate but also significantly slower than YOLO. It is useful for comparison to other detection systems that rely on VGG-16 but since it is slower than real-time the rest of the paper focuses on our faster models.
咱们还使用VGG-16训练YOLO。这个模型比YOLO更精确,但也比它慢不少。它与依赖于VGG-16的其余检测系统相比是有用的,但因为它比实时更慢,因此本文的其余部分将重点放在咱们更快的模型上。
Fastest DPM effectively speeds up DPM without sacrificing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network approaches.
最快的DPM能够在不牺牲不少mAP的状况下有效加速DPM,但仍然会将实时性能下降2倍[38]。与神经网络方法相比,DPM的检测精度也受到限制。
R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While it is much faster than R-CNN, it still falls short of real-time and takes a significant accuracy hit from not having good proposals.
R-CNN减去R用静态边界框提案取代选择性搜索[20]。虽然速度比R-CNN快得多,但它仍然没有实时性,而且因为没有好的提案而受到很大的精确度影响。

Table 1: Real-Time Systems on PASCAL VOC 2007. Comparing the performance and speed of fast detectors. Fast YOLO is the fastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector. YOLO is 10 mAP more accurate than the fast version while still well above real-time in speed.
表1:PASCAL VOC 2007上的实时系统。比较快速检测器的性能和速度。快速的YOLO是PASCAL VOC检测记录中速度最快的检测器,而且仍然是其余任何实时检测器的两倍。YOLO比快速版本更精确10 mAP,同时仍高于实时速度。
Fast R-CNN speeds up the classification stage of R-CNN but it still relies on selective search which can take around 2 seconds per image to generate bounding box proposals. Thus it has high mAP but at 0.5 fps it is still far from realtime.
快速R-CNN加快了R-CNN的分类阶段,但仍依赖于选择性搜索,每一个图像须要大约2秒才能生成边界框提议。所以它具备很高的mAP,但在0.5 fps时仍然远未实时。
The recent Faster R-CNN replaces selective search with a neural network to propose bounding boxes, similar to Szegedy et al. [8] In our tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The ZeilerFergus Faster R-CNN is only 2.5 times slower than YOLO but is also less accurate.
最近更快的R-CNN用神经网络替代了选择性搜索来提出边界框,相似于Szegedy等。 [8]在咱们的测试中,他们最准确的模型达到了7 fps,而较小的,不太准确的模型以18 fps运行。速度更快的R-CNN的VGG-16版本高出10 mAP,但比YOLO慢6倍。ZeilerFergus Faster R-CNN只比YOLO慢2.5倍,但也不太准确。
4.2. VOC 2007 Error Analysis4.2。 VOC 2007错误分析
To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown of results on VOC 2007. We compare YOLO to Fast RCNN since Fast R-CNN is one of the highest performing detectors on PASCAL and it’s detections are publicly available.
为了进一步研究YOLO和最早进的探测器之间的差别,咱们详细分析了VOC 2007的结果。咱们将YOLO与Fast RCNN进行了比较,由于Fast R-CNN是PASCAL上性能最高的探测器之一,而且它的检测功能是公开可用的。
We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predictions for that category. Each prediction is either correct or it is classified based on the type of error:
咱们使用Hoiem等人的方法和工具。 [19]对于测试时的每一个类别,咱们看看该类别的前N个预测。每一个预测都是正确的,或者根据错误类型进行分类:
• Correct: correct class and IOU > .5
•正确:正确的课程和IOU> .5
• Localization: correct class, IOU < .5
•本地化:正确的课程, IOU <.5
• Similar: class is similar, IOU > .1 • Other: class is wrong, IOU > .1 • Background: IOU < .1 for any object Figure 4 shows the breakdown of each error type averaged across all 20 classes.
•相似:class相似,IOU> .1•其余:类是错误的,IOU> .1•背景:任何对象的IOU <.1图4显示了全部20个类的平均错误类型的细分。

Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
图4:错误分析:快速R-CNN与YOLO这些图表显示了各类类别(N =该类别中的#个对象)的前N个检测中的本地化和背景错误的百分比。
YOLO struggles to localize objects correctly. Localization errors account for more of YOLO’s errors than all other sources combined. Fast R-CNN makes much fewer localization errors but far more background errors. 13.6% of it’s top detections are false positives that don’t contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO.
YOLO努力正确地定位对象。本地化错误占全部YOLO错误的总和超过全部其余来源的总和。快速R-CNN使本地化错误更少,但背景错误更多。其最高检测结果中有13.6%是误报,不包含任何对象。快速R-CNN预测背景检测的可能性比YOLO高3倍。
4.3. Combining Fast R-CNN and YOLO4.3。结合Fast R-CNN和YOLO
YOLO makes far fewer background mistakes than Fast R-CNN. By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
YOLO制做的背景错误比Fast R-CNN少得多。经过使用YOLO消除Fast R-CNN的背景检测,咱们得到了显着的性能提高。对于R-CNN预测的每一个边界框,咱们检查YOLO是否预测了一个相似的框。若是确实如此,那么咱们会根据YOLO预测的几率和两个框之间的重叠状况给出预测。
The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set. When combined with YOLO, its mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN. Those ensembles produced small increases in mAP between .3 and .6%, see Table 2 for details.
最好的Fast R-CNN模型在VOC 2007测试集中达到了71.8%的mAP。当与YOLO合并时,其mAP增长了3.2%至75.0%。咱们还尝试将顶级Fast R-CNN模型与其余几个版本的Fast R-CNN相结合。这些乐团的平均增幅在0.3%至0.6%之间,详情见表2。

Table 2: Model combination experiments on VOC 2007. We examine the effect of combining various models with the best version of Fast R-CNN. Other versions of Fast R-CNN provide only a small benefit while YOLO provides a significant performance boost.
表2:VOC 2007的模型组合实验。咱们研究了将各类型号与Fast R-CNN的最佳版本相结合的效果。Fast R-CNN的其余版本只提供小的好处,而YOLO则提供显着的性能提高。

Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
表3:PASCAL VOC 2012排行榜截至2015年11月6日,YOLO与完整comp4(外部数据容许)公共排行榜相比。显示了各类检测方法的平均精度和每类平均精度。 YOLO是惟一的实时检测器。 Fast R-CNN + YOLO是第四高评分方法,比Fast R-CNN提高了2.3%。
The boost from YOLO is not simply a byproduct of model ensembling since there is little benefit from combining different versions of Fast R-CNN. Rather, it is precisely because YOLO makes different kinds of mistakes at test time that it is so effective at boosting Fast R-CNN’s performance.
YOLO的推进并不只仅是模式集成的副产品,由于将Fast R-CNN的不一样版本组合在一块儿并无什么好处。相反,正是由于YOLO在测试时出现了各类各样的错误,因此它在提升Fast R-CNN的性能方面很是有效。
Unfortunately, this combination doesn’t benefit from the speed of YOLO since we run each model seperately and then combine the results. However, since YOLO is so fast it doesn’t add any significant computational time compared to Fast R-CNN.
不幸的是,这种组合并不能从YOLO的速度中受益,由于咱们单独运行每一个模型,而后结合结果。可是,因为YOLO速度如此之快,与Fast R-CNN相比,它不会增长任何重要的计算时间。
4.4. VOC 2012 Results4.4。 VOC 2012结果
On the VOC 2012 test set, YOLO scores 57.9% mAP. This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our system struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher performance.
在VOC 2012测试集中,YOLO评分为57.9%。这比现有技术水平低,使用VGG-16更接近原始R-CNN,参见表3。与其最接近的竞争对手相比,咱们的系统与小物体斗争。像瓶子,绵羊和 YOLO的分数比R-CNN或特征编辑低8-10%。然而,在其余类别,如猫和火车YOLO实现更高的性能。
Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods. Fast R-CNN gets a 2.3% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard.
咱们的组合Fast R-CNN + YOLO模型是性能最高的检测方法之一。Fast R-CNN与YOLO的组合提升了2.3%,在公共排行榜上提高了5个位置。
4.5. Generalizability: Person Detection in Artwork4.5。归纳性:做品中的人物检测
Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications it is hard to predict all possible use cases and the test data can diverge from what the system has seen before [3]. We compare YOLO to other detection systems on the Picasso Dataset [12] and the People-Art Dataset [3], two datasets for testing person detection on artwork.
用于对象检测的学术数据集从同一分布中绘制训练和测试数据。在现实世界的应用中,很难预测全部可能的用例,而测试数据可能与系统在以前看到的不一样[3]。咱们将YOLO与毕加索数据集[12]和人物艺术数据集[3]中的其余检测系统进行了比较,这两个数据集用于检测艺术品检测人员。
Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.
图5显示了YOLO和其余检测方法之间的比较性能。做为参考,咱们给全部型号仅在VOC 2007数据上进行培训的人员提供VOC 2007检测AP。毕加索模型接受VOC 2012培训,而People-Art则接受VOC 2010培训。
R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals.
R-CNN在VOC 2007上有很高的AP值。然而,当应用于艺术品时,R-CNN显着降低。R-CNN使用选择性搜索来调整天然图像的边界框提案。R-CNN中的分类步骤只能看到小区域,须要很好的建议。
DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. Though DPM doesn’t degrade as much as R-CNN, it starts from a lower AP.
DPM在应用于艺术品时能够很好地维护其AP。以前的工做认为DPM表现良好,由于它具备强大的物体形状和布局空间模型。虽然DPM不会像R-CNN那样退化,但它从较低的AP开始。
YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Like DPM, YOLO models the size and shape of objects, as well as relationships between objects and where objects commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.
YOLO在挥发性有机化合物(VOC)2007上表现出色,其应用于艺术品时其AP下降程度低于其余方法。与DPM同样,YOLO模拟对象的大小和形状,以及对象之间的关系和对象一般出现的位置之间的关系。艺术品和天然图像在像素级别上有很大不一样,但它们在物体的大小和形状方面类似,所以YOLO仍然能够预测好的边界框和检测结果。
5. Real-Time Detection In The Wild5.野外实时检测
YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a webcam and verify that it maintains real-time performance, (b) Quantitative results on the VOC 2007, Picasso, and People-Art Datasets. (a) Picasso Dataset precision-recall curves. The Picasso Dataset evaluates on both AP and best score.
YOLO是一款快速,精确的物体检测器,很是适合计算机视觉应用。咱们将YOLO链接到网络摄像头并验证它是否保持实时性能,(b)VOC 2007,Picasso和People-Art数据集的定量结果。 (a)毕加索数据集精度 - 召回曲线。毕加索数据集对AP和最佳评分进行评估。

Figure 5: Generalization results on Picasso and People-Art datasets.
图5:Picasso和People-Art数据集的综合结果。

Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it does think one person is an airplane.
图6:定性结果。 YOLO运行在互联网上的样本艺术做品和天然图像上。尽管它确实认为一我的是飞机,但它基本上是准确的。

including the time to fetch images from the camera and display the detections.
包括从相机提取图像并显示检测结果的时间。
The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a webcam it functions like a tracking system, detecting objects as they move around and change in appearance. A demo of the system and the source code can be found on our project website: http://pjreddie.com/yolo/.
由此产生的系统是互动的和参与的。虽然YOLO单独处理图像,但当链接到网络摄像头时,它的功能相似于跟踪系统,可在物体四处移动并在外观上发生变化时检测物体。系统演示和源代码可在咱们的项目网站上找到:http://pjreddie.com/yolo/。
6. Conclusion六,结论
We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and the entire model is trained jointly.
咱们介绍YOLO,一种统一的物体检测模型。咱们的模型构造简单,能够直接在完整图像上训练。与基于分类器的方法不一样,YOLO对于直接对应于检测性能的损失函数进行了训练,而且整个模型被共同训练。
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.
快速YOLO是文献中最快的通用对象检测器,YOLO推进实时对象检测的最新技术。YOLO还很好地推广到新领域,使其成为依赖快速,强大物体检测的应用的理想选择。
Acknowledgements: This work is partially supported by ONR N00014-13-1-0720, NSF IIS-1338054, and The Allen Distinguished Investigator Award.
致谢:这项工做获得了ONR N00014-13-1-0720,NSF IIS-1338054和艾伦杰出研究者奖的部分支持。
References参考
[1] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In Computer Vision– ECCV 2008, pages 2–15. Springer, 2008. 4
[1] M. B. Blaschko和C. H. Lampert。学习使用结构化输出回归对对象进行本地化。在计算机视觉 - ECCV 2008,第2-15页。 Springer,2008. 4
[2] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In International Conference on Computer Vision (ICCV), 2009. 8
[2] L.Bourdev和J.Malik。运动员身体部位探测器训练使用三维人体姿式注释。在计算机视觉国际会议(ICCV),2009年。8
[3] H. Cai, Q. Wu, T. Corradi, and P. Hall. The crossdepiction problem: Computer vision algorithms for recognising objects in artwork and in photographs. arXiv preprint arXiv:1505.00110, 2015. 7
[3] H. Cai,Q. Wu,T. Corradi和P. Hall。交叉问题:用于识别艺术品和照片中物体的计算机视觉算法。 arXiv预印本arXiv:1505.00110,2015
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. 4, 8
[4] N.达拉尔和B. Triggs。面向人体检测的梯度直方图。在Computer Vision and Pattern Recognition,2005.CVPR 2005.IEEE Computer Society Conference on,第1卷,第886-893页中。 IEEE,2005. 4,8
[5] T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, J. Yagnik, et al. Fast, accurate detection of 100,000 object classes on a single machine. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 1814–1821. IEEE, 2013. 5
[5] T. Dean,M. Ruzon,M. Segal,J. Shlens,S. Vijayanarasimhan,J. Yagnik,et al。在单台机器上快速准确地检测100,000个对象类别。在计算机视觉和模式识别(CVPR),2013 IEEE会议上,第1814-1821页。 IEEE,2013。5
[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013. 4
[6] J. Donahue,Y. Jia,O. Vinyals,J. Hoffman,N. Zhang,E. Tzeng和T. Darrell。 Decaf:用于通用视觉识别的深层卷积激活功能。 arXiv预印本arXiv:1310.1531,2013。4
[7] J. Dong, Q. Chen, S. Yan, and A. Yuille. Towards unified object detection and semantic segmentation. In Computer Vision–ECCV 2014, pages 299–314. Springer, 2014. 7
[7] J. Dong,Q. Chen,S. Yan和A. Yuille。迈向统一的对象检测和语义分割。在“计算机视觉-ECCV 2014”中,第299-314页。斯普林格,2014年。7
[8] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2155–2162. IEEE, 2014. 5, 6
[8] D. Erhan,C. Szegedy,A. Toshev和D. Anguelov。使用深度神经网络的可伸缩对象检测在计算机视觉和模式识别(CVPR),2014年IEEE会议上,第2155-2162页。 IEEE,2014.5,6
[9] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015. 2
[9] M. Everingham,S. M. A. Eslami,L. Van Gool,C. K. I. Williams,J. Winn和A. Zisserman。帕斯卡视觉对象课挑战:回顾。国际计算机视觉杂志,111(1):98-136,2015年1月。2
[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 1, 4
[10] P. F. Felzenszwalb,R. B. Girshick,D. McAllester和D. Ramanan。利用有区别地训练零件的模型进行目标检测。IEEE Transactions on Pattern Analysis and Machine Intelligence,32(9):1627-1645,2010。1,4
[11] S. Gidaris and N. Komodakis. Object detection via a multiregion & semantic segmentation-aware CNN model. CoRR, abs/1505.01749, 2015. 7
[11] S. Gidaris和N. Komodakis。经过多区域和语义分割感知CNN模型进行对象检测。 CoRR,abs / 1505.01749,2015。7
[12] S. Ginosar, D. Haas, T. Brown, and J. Malik. Detecting people in cubist art. In Computer Vision-ECCV 2014 Workshops, pages 101–116. Springer, 2014. 7
[12] S. Ginosar,D.哈斯,T.布朗和J.马利克。侦察立体派艺术中的人物。在计算机视觉-ECCV 2014研讨会上,第101-116页。斯普林格,2014年。7
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014. 1, 4, 7
[13] R. Girshick,J. Donahue,T. Darrell和J. Malik。丰富的功能层次结构,用于精确的对象检测和语义分割。在计算机视觉和模式识别(CVPR),2014 IEEE会议上,第580-587页。 IEEE,2014,1,4,7
[14] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. 2, 5, 6, 7
[14] R. B. Girshick。快R-CNN。 CoRR,abs / 1504.08083,2015.2,5,6,7
[15] S. Gould, T. Gao, and D. Koller. Region-based segmentation and object detection. In Advances in neural information processing systems, pages 655–663, 2009. 4
[15] S. Gould,T. Gao和D. Koller。基于区域的分割和对象检测。在Advances in neural information processing systems,第655-663页,2009年。4
[16] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In Computer Vision– ECCV 2014, pages 297–312. Springer, 2014. 7
[16] B. Hariharan,P.阿尔贝阿兹,R. Girshick和J.马利克。同时检测和分割。在计算机视觉 - ECCV 2014,第297-312页。斯普林格,2014年。7
[17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv preprint arXiv:1406.4729, 2014. 5
[17] K. He,X. Zhang,S. Ren和J. Sun.空间金字塔池在深度卷积网络中进行视觉识别。 arXiv预印本arXiv:1406.4729,2014。5
[18] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. 4
[18] G. E. Hinton,N. Srivastava,A. Krizhevsky,I. Sutskever和R. R. Salakhutdinov。经过防止特征检测器的共同适应来改进神经网络。 arXiv预印本arXiv:1207.0580,2012。4
[19] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. In Computer Vision–ECCV 2012, pages 340–353. Springer, 2012. 6
[19] D. Hoiem,Y. Chodpathumwan和Q. Dai。诊断物体检测器中的错误。在“计算机视觉-ECCV 2012”中,第340-353页。斯普林格,2012年。6
[20] K. Lenc and A. Vedaldi. R-cnn minus r. arXiv preprint arXiv:1506.06981, 2015. 5, 6
[20] K. Lenc和A. Vedaldi。 R-cnn减r。 arXiv预印本arXiv:1506.06981,2015。5,6
[21] R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In Image Processing. 2002. Proceedings. 2002 International Conference on, volume 1, pages I–900. IEEE, 2002. 4
[21] R. Lienhart和J. Maydt。用于快速对象检测的扩展哈利式特征。在图像处理中。 2002年。会议录。 2002年国际会议,第1卷,第I-900页。 IEEE,2002。4
[22] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2
[22] M. Lin,Q. Chen和S. Yan。网络中的网络。 CoRR,abs / 1312.4400,2013。2
[23] D. G. Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999. 4
[23] D. G. Lowe。来自局部尺度不变特征的物体识别。计算机视觉,1999年。第七届IEEE国际会议论文集,第2卷,第1150-1157页。 Ieee,1999。4
[24] D. Mishkin. Models accuracy on imagenet 2012 val. https://github.com/BVLC/caffe/wiki/ Models-accuracy-on-ImageNet-2012-val. Accessed: 2015-10-2. 3
[24] D. Mishkin。 imagenet 2012 val的模型精度。 https://github.com/BVLC/caffe/wiki/ Models-accuracy-on-ImageNet-2012-val。访问日期:2015-10-2。 3
[25] C. P. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Computer vision, 1998. sixth international conference on, pages 555–562. IEEE, 1998. 4
[25] C. P. Papageorgiou,M. Oren和T. Poggio。对象检测的通常框架。在计算机视觉,1998年第六届国际会议上,第555-562页。 IEEE,1998。4
[26] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3
[26] J. Redmon。 Darknet:c中的开源神经网络。 http://pjreddie.com/darknet/,2013-2016。 3 [27] J. Redmon and A. Angelova. Real-time grasp detection using convolutional neural networks. CoRR, abs/1412.3128, 2014. 5 [27] J. Redmon和A. Angelova。使用卷积神经网络实时掌握检测。 CoRR,abs / 1412.3128,2014。5 [28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 5, 6, 7 [28] S. Ren,K. He,R. Girshick和J. Sun.更快r-cnn:利用地区提案网络实现实时对象检测。 arXiv预印本arXiv:1506.01497,2015。5,6,7 [29] S. Ren, K. He, R. B. Girshick, X. Zhang, and J. Sun. Object detection networks on convolutional feature maps. CoRR, abs/1504.06066, 2015. 3, 7 [29] S. Ren,K. He,R. B. Girshick,X. Zhang和J. Sun.卷积特征映射上的对象检测网络。 CoRR,abs / 1504.06066,2015。3,7 [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 3 [30] O. Russakovsky,J. Deng,H. Su,J. Krause,S. Satheesh,S. Ma,Z. Huang,A. Karpathy,A. Khosla,M. Bernstein,AC Berg和L. Fei -Fei。 Imagenet大规模视觉识别挑战。国际计算机视觉杂志(IJCV),2015。3 [31] M. A. Sadeghi and D. Forsyth. 30hz object detection with dpm v5. In Computer Vision–ECCV 2014, pages 65–79. Springer, 2014. 5, 6 [31] M.A.Sadeghi和D.Forsyth。使用dpm v5进行30hz对象检测。在“计算机视觉-ECCV 2014”中,第65-79页。施普林格,2014.5,6 [32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013. 4, 5 [32] P. Sermanet,D. Eigen,X. Zhang,M. Mathieu,R. Fergus和Y. LeCun。 Overfeat:使用卷积网络的综合识别,定位和检测。 CoRR,abs / 1312.6229,2013。4,5 [33] Z. Shen and X. Xue. Do more dropouts in pool5 feature maps for better object detection. arXiv preprint arXiv:1409.6911, 2014. 7 [33] Z. Shen和X. Xue。在pool5功能图中作更多的退出以得到更好的对象检测。 arXiv预印本arXiv:1409.6911,2014 [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2 [34] C. Szegedy,W. Liu,Y. Jia,P. Sermanet,S. Reed,D. Anguelov,D. Erhan,V. Vanhoucke和A. Rabinovich。进一步与卷积。 CoRR,abs / 1409.4842,2014。2 [35] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013. 4 [35] J. R. Uijlings,K. E. van de Sande,T. Gevers和A. W. Smeulders。选择性搜索对象识别。国际计算机视觉杂志,104(2):154-171,2013。4 [36] P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 4:34–47, 2001. 4 [36] P.维奥拉和M.琼斯。强大的实时对象检测。国际计算机视觉杂志,4:34-47,2001。4 [37] P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004. 5 [37] P. Viola和M.J.琼斯。强大的实时人脸检测。国际计算机视觉杂志,57(2):137-154,2004。5 [38] J. Yan, Z. Lei, L. Wen, and S. Z. Li. The fastest deformable part model for object detection. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 2497–2504. IEEE, 2014. 5, 6 [38] J. Yan,Z. Lei,L. Wen和S. Z. Li。用于物体检测的最快可变形零件模型。在计算机视觉和模式识别(CVPR),2014 IEEE会议上,第2497-2504页。 IEEE,2014.5,6 [39] C. L. Zitnick and P. Doll´ar. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014, pages 391–405. Springer, 2014. 4 [39] C. L. Zitnick和P. Doll’ar。边框:从边缘查找对象提案。在“计算机视觉 - ECCV 2014”中,第391-405页。斯普林格,2014。4