pytorch实现yolov3(1) yolov3基本原理

时间 2019-12-13

标签 pytorch 实现 yolov3 yolov 基本原理繁體版

原文原文链接

理解一个算法最好的就是实现它,对深度学习也同样,准备跟着https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/一点点地实现yolov3.达到熟悉yolov3和pytorch的目的.算法

这篇做为第一篇,讲yolov3基本原理.ide

卷积后的输出

通过basenet(darknet-53)不断的卷积之后获得一个feature map. 咱们就用这个feature map来作预测.
比方说原始输入是416*416*3,一通卷积之后获得一个13*13*depth的feature map.
这个feature map的每个cell都有其对应的感觉野.(简单滴说:即当前cell的值受到原始图像的哪些pixel的影响).因此如今咱们假设每一个cell能够预测出一个boundingbox.boudingbox所框出的object的正中心落于当前cell.函数

You expect each cell of the feature map to predict an object through one of it's bounding boxes if the center of the object falls in the receptive field of that cell. (Receptive field is the region of the input image visible to the cell. Refer to the link on convolutional neural networks for further clarification).学习

好比上图的红色cell负责预测狗这个object.spa

feature map的size为N*N*Depth,其中Depth=(B x (5 + C))

B指每一个cell预测几个boundingbox. 5=4+1. 4表明用于预测boudingbox的四个值,1表明object score,表明这个boundingbox包含目标的几率,C表明要预测的类别个数..net

如何计算predicted box的坐标

Anchor Boxes

anchor box是事先聚类出来的一组值.能够理解为最接近现实的object的宽,高.
yolov3中feature map的每个cell都预测出3个bounding box.可是只选用与ground truth box的IOU最大的作预测.blog

预测

bx, by, bw, bh are the x,y center co-ordinates, width and height of our prediction. tx, ty, tw, th is what the network outputs. cx and cy are the top-left co-ordinates of the grid. pw and ph are anchors dimensions for the box.排序

bx by bw bh是预测值表明预测的bouding box的中心点坐标宽高
tx, ty, tw, th 是卷积获得的feature map在depth方向的值
cx,cy是当前cell左上角坐标
pw,ph是事先聚类获得的anchors值

上图中的σ(tx)是sigmoid函数,以确保它的值在0-1之间.这样才能确保预测出来的坐标坐落在当前cell内.好比cell左上角是(6,6),center算出来是(0.4,0.7),那么预测的boudingbox的中心就是(6.4,6.7),若是算出来center是(1.2,0.7),那boundingbox的中心就落到了(7.2,6.7)了,就再也不是当前cell了,这与咱们的假设是相悖的.(咱们假设当前cell是它负责预测的object的中心).
get

objectness score

这个也是由sigmoid限制到0-1之间,表示包含一个object的几率.input

Class Confidences

表示当前object属于某一个class的几率. yolov3再也不使用softmax获得.由于softmax默认是排他的.即一个object属于class1,就不可能属于class2. 但实际上一个object可能既属于women又属于person.

多尺度检测

yolov3借鉴了特征金字塔的概念,引入了多尺度检测,使得对小目标检测效果更好.
以416*416为例,一系列卷积之后获得13*13的feature map.这个feature map有比较丰富的语义信息,可是分辨率不行.因此经过upsample生成26*26,52*52的feature map,语义信息损失不大,分辨率又提升了,从而对小目标检测效果更好.

对416 x 416, 预测出((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647个bounding boxes.经过object score排序,滤掉score太低的,再经过nms逐步肯定最终的bounding box.

nms解释看下这个https://blog.csdn.net/zchang81/article/details/70211851.
简单滴说就是每一轮都标记出一个score最高的,把和最高的这个box相似的box去掉,循环反复,最终就获得了最终的box.

refrence:https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/