You Only Look Once: Unified, Real-Time Object Detection 论文详读

做为初学者读专业论文仍是有点费劲,因此就把读的过程记录一下,方便之后再看;ios

You Only Look Once: Unified, Real-Time Object Detection

Abstract

We present YOLO, a new approach to object detection.Prior work on object detection repurposes classifiers to perform
detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and
associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.web

简言之就是用一个神经网络就可以同时预测边框位置和分类,YOLO也所以得到了良好的end-to-end能力。算法

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO,
processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict
false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms
all other detection methods, including DPM and RCNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People- Art Dataset.网络

YOLO速度很快,具体数据见上文,而且在一些状况下,YOLO虽然会在物体定位上出错多一点,可是当该处不存在物体时,预测错误会不多。架构

1. Introduction

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image
and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding box, eliminate duplicate detections, and rescore the box based on other objects in the scene [13]. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.app

相比于R-CNN:R-CNN首先提取可能的边框,而后经过分类器进行学习。再进行后续处理进行矫正,R-CNN也所以很慢。
图1less

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes
the input image to 448 * 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.dom

(1)首先把图片重整为448*448;(2)而后输入一个卷积网络;(3)最后利用置信度和阈值进行筛选ide

YOLO is refreshingly simple: see Figure 1. A single convolutional network simultaneously predicts multiple
bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection
performance. This unified model has several benefits over traditional methods of object detection.svg

如下开始讲优势:

First, YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections. Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This means we can process streaming video in real-time with less than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running in real-time on a webcam please see our (anonymous) YouTube channel: https://goo.gl/bEs6Cj.

仍是以前说过的,速度快是最大的优势。

Second, YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method [14], mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

由于是基于整幅图像的学习,因此误识别背景的几率相比其余算法更低。

Third, YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to break down when applied to new domains or unexpected input.

对于复杂的状况诸如天然背景的图片(我理解的是天然里面的颜色比较杂,因此难识别)和艺术画做,也能够应用而不会出问题。

2. Unified Detection

Our system divides the input image into a SxS grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

先把整幅图像分为S*S,物体中心所在的grid将负责识别这个物体。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Formally we define confidence as Pr(Object)  IOUtruth pred . If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.
(Pr(Object)  IOUtruth pred以下:)
公式1

每一个grid cell会预测B个边框和置信度。置信度反映了这个框包含物体的几率和准确度。置信度公式如图。若是框内没有物体,那么置信度为0。咱们但愿置信度能够等同于预测框和实际框的交集程度。

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x; y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每一个对bounding box的描述都由五个值组成:x,y表明了相对于grid cell左上角的坐标,w,h是相对于整个图片,confidence表明了预测框和实际框IOU的大小。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

每一个grid cell同时分析物体属于C个类别中的哪个。在此就再也不管前面B个预测框,直接算属于C个类中每个类的几率。

At test time we multiply the conditional class probabilities and the individual box confidence predictions,
公式2
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

将两个公式合并之后推导出的式子能够同时衡量属于某一个类的几率和识别的准确度。

For evaluating YOLO on PASCAL VOC, we use S = 7, B = 2. PASCAL VOC has 20 labelled classes so C = 20. Our final prediction is a 7x7x30 tensor.
图2
Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an even grid and simultaneously predicts bounding boxes, confidence in those boxes, and class probabilities. These predictions are encoded as an SxSx(Bx5 + C) tensor.

本文中,将图片分为7X7个grid cell,每一个grid cell有两个bounding box,而且在20个分类中进行选择,因此计算得出张量。

2.1. Design

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [9]. The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

使用PASCAL VOC数据集,初始卷积层提取图像特征,全链接层预测输出几率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [33]. Our network has 24 convolutional layers followed by 2 fully connected layers.
However, instead of the inception modules used by GoogLeNet we simply use 1x1 reduction layers followed by 3x3 convolutional layers, similar to Lin et al [22]. The full network is shown in Figure 3.

使用GoogleNet的神经网络结构来进行分类,由24个卷积层和两个全链接层组成。可是不一样的是,使用了1X1的reduction层(Reduction层的功能:使用sum或mean等操做做用于输入blob按照给定参数规定的维度。通俗的讲就是将输入的特征图按照给定的维度进行求和或求平均——百度出品,准否未知)和3X3的卷积核。

图3

这张图片列明了各层的设计,再也不赘述。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a
neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the same between YOLO and Fast YOLO.
The final output of our network is the 7x7x30 tensor of predictions.

为了追求更快的速度而特意作了修改,诞生了Fast YOLO。

2.2. Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [29]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s Model Zoo [24].
We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers
to pretrained networks can improve performance [28]. Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224x224 to 448x448.

在训练的时候,做者没有用原来的架构直接训练,而是先预训练了一个20个卷积层的网络,后接一个平均池化层和一个全链接层。而后加入了四个卷积层和两个全链接层(参数被随机初始化)。图片像素也变大了。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

这里讲到了归一化的方法。对于h,w,经过除以图片长和宽的方式归一化到0到1区间。对于x,y,用网格的偏移量归一化到0~1之间。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

最后一层用线性激活函数,其余层用下述方式的激活函数:
在这里插入图片描述

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.
To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, coord and noobj to accomplish this. We set coord = 5 and noobj = :5.
Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width
and height directly.
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

下面是训练损失函数的分析,Yolo算法将目标检测当作回归问题,因此采用的是均方差损失函数。可是对不一样的部分采用了不一样的权重值。首先区分定位偏差和分类偏差。对于定位偏差,即边界框坐标预测偏差,采用较大的权重λcoord=5。而后其区分不包含目标的边界框与含有目标的边界框的置信度,对于前者,采用较小的权重值λnoobj=0.5。其它权重值均设为1。而后采用均方偏差,其同等对待大小不一样的边界框,可是实际上较小的边界框的坐标偏差应该要比较大的边界框要更敏感。为了保证这一点,将网络的边界框的宽与高预测改成对其平方根的预测,即预测值变为(x,y,sqrt(w),sqrt(h))。
另一点时,因为每一个单元格预测多个边界框。可是其对应类别只有一个。那么在训练时,若是该单元格内确实存在目标,那么只选择与ground truth的IOU最大的那个边界框来负责预测该目标,而其它边界框认为不存在目标。这样设置的一个结果将会使一个单元格对应的边界框更加专业化,其能够分别适用不一样大小,不一样高宽比的目标,从而提高模型性能。你们可能会想若是一个单元格内存在多个目标怎么办,其实这时候Yolo算法就只能选择其中一个来训练,这也是Yolo算法的缺点之一。要注意的一点时,对于不存在对应目标的边界框,其偏差项就是只有置信度,左标项偏差是无法计算的。而只有当一个单元格内确实存在目标时,才计算分类偏差项,不然该项也是没法计算的。
对于损失函数计算以下:

损失函数
损失函数如上见注释。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

只在负责的grid cell预测错误时和物体误识别时用损失函数惩罚。

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data. Throughout training we use a batch size of 64, a momentum of 0:9 and a decay of 0:0005.

上述是一些训练参数。

Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10^-3 to 10^-2. If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10^-2 for 75 epochs, then decrease to 10^-3 for 30 epochs, and finally decrease again to 10^-4 for 30 epochs.

上述是随着次数的增长变换学习率。

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [18]. For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by up to a factor of 1:5 in the HSV color space.

为了不过拟合采用随机丢弃一些节点。为了扩大数据,随机引入20%的原始图像大小的数据量。同时也会随机调整最大可达1.5倍的曝光度和饱和度(在HSV图像通道)。

2.3. Inference

2.4. Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

由于只能预测两个bounding box,因此检测到的物体有限。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

bounding box预测相对粗糙。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

对于小的图片来讲,即便是小的错误也会形成很大的影响。

3. Comparison to Other Detection Systems

下面介绍的是和其余算法的比较,不详述。

4. Experiments

下面是和其余实时检测算法的比较和在具体数据集上的实验数据。再也不赘述。

5. Real-Time Detection In The Wild

A demo of the system can be found on our YouTube channel:
https://goo.gl/bEs6Cj.

6. Conclusion

We introduce YOLO, a unified model for object detection. Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds
to detection performance and the entire model is trained jointly.
Fast YOLO is the fastest general-purpose object detector in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to new domains making it ideal for applications that rely on fast, robust object detection.

References