Mxnet (33): 多盒目标检测（SSD）检测香蕉

1. 香蕉检测数据集

对象检测没有像MNIST或Fashion-MNIST这样的小型数据集。为了快速测试模型，能够本身组装数据集。首先使用香蕉生成1000个角度和大小不一样的香蕉图像。而后收集一些背景图片，并将香蕉图像放置在每一个图像的随机位置。制做好的香蕉检测数据集能够在网上下载。html

from d2l import mxnet as d2l
from mxnet import gluon, image, np, npx, autograd, init
from mxnet.gluon import nn
from plotly import graph_objs as go, express as px
from plotly.subplots import make_subplots
from IPython.display import Image
import plotly.io as pio
import os
pio.kaleido.scope.default_format = "svg"
npx.set_np()

d2l.DATA_HUB['bananas'] = (d2l.DATA_URL + 'bananas.zip',
                           'aadfd1c4c5d7178616799dd1801c9a234ccdaf19')

对于训练集的每一个图像，咱们将使用随机裁剪，并要求裁剪后的图像至少覆盖每一个对象的95％。因为裁剪是随机的，所以不必定老是知足此要求。咱们将随机裁剪的最大尝试次数预设为200。若是没有一次符合要求，则不会裁剪图像。为了确保输出的肯定性，咱们不会在测试数据集中随机裁剪图像。python

def load_data_bananas(batch_size, edge_size=256):
    data_dir = d2l.download_extract('bananas')
    train_iter = image.ImageDetIter(
        path_imgrec=os.path.join(data_dir, 'train.rec'),
        path_imgidx=os.path.join(data_dir, 'train.idx'),
        batch_size=batch_size,
        data_shape=(3, edge_size, edge_size),  # 图像的形状
        shuffle=True,  # 随机读取
        rand_crop=1,  # 随机裁剪的触发几率为1
        min_object_covered=0.95, max_attempts=200)
    val_iter = image.ImageDetIter(
        path_imgrec=os.path.join(data_dir, 'val.rec'), batch_size=batch_size,
        data_shape=(3, edge_size, edge_size), shuffle=False)
    return train_iter, val_iter

图像的形状与以前的实验相同（批处理大小，通道数，高度，宽度）。标签的形状是（批量大小， m ，5）,m 等于数据集中单个图像中包含的最大边界框数量。尽管小批量的计算很是有效，但它要求每一个图像包含相同数量的边界框，以便它们能够放在同一批中。因为每一个图像可能具备不一样数量的边界框，所以咱们能够向包含如下内容的图像添加非法边界框，直到每一个图像包含m边界框。图像中每一个边框的标签由长度为5的数组表示。数组中的第一个元素是边框中包含的对象的类别。当值为-1时，边界框是用于填充目的的非法边界框。数组的其他四个元素表明边界框左上角的 x,y 以及边界框的右下角x,y（值范围介于0和1之间）。ios

batch_size, edge_size = 32, 256
train_iter, _ = load_data_bananas(batch_size, edge_size)
batch = train_iter.next()
batch.data[0].shape, batch.label[0].shape

# ((32, 3, 256, 256), (32, 1, 5))

2.数据示范

咱们有十张带有边框的图像。咱们能够看到，每一个图像中香蕉的角度，大小和位置都不一样。固然，这是一个简单的人工数据集, 在实际实践中，数据一般要复杂得多。git

def show_imgs(imgs, num_rows=2, num_cols=4, scale=0.8, labels=None) :
    fig = make_subplots(num_rows, num_cols)
    for i in range(num_rows):
        for j in range(num_cols):
            z = imgs[num_cols*i+j].asnumpy()
            fig.add_trace(go.Image(z=z),i+1,j+1)
            if labels is not None:
                x0, y0, x1, y1 = labels[num_cols*i+j][0][1:5] * edge_size
                fig.add_shape(type="rect",x0=x0,y0=y0,x1=x1,y1=y1,line=dict(color="white"),row=i+1, col=j+1)
            fig.update_xaxes(visible=False, row=i+1, col=j+1)
            fig.update_yaxes(visible=False, row=i+1, col=j+1)
    img_bytes = fig.to_image(format="png", scale=scale, engine="kaleido")
    return img_bytes
    
imgs = (batch.data[0][0:10].transpose(0, 2, 3, 1))
Image(show_imgs(imgs, 2, 5, scale=2, labels= batch.label[0][0:10]))

3.单发多盒检测（SSD）

构建用于对象检测模型：单发多盒检测（SSD）。该模型的主要组件是基础网络模块和串联链接的多个多尺度功能模块。在这里，基本网络块用于提取原始图像的特征，而且一般采用深度卷积神经网络的形式。咱们能够设计基础网络，使其输出更大的高度和宽度。经过这种方式，能够基于此特征图生成更多锚点框，从而使咱们可以检测较小的对象。接下来，每一个多尺度特征块都会减少上一层提供的特征图的高度和宽度（例如，能够将尺寸减少一半）。而后，这些块使用特征图中的每一个元素来扩展输入图像上的接收场。多尺度特征块离图的顶部越近。它的输出特征图越小，基于该特征图生成的锚点框越少。另外，特征块离顶部越近，特征图中每一个元素的接受场越大，它越适合检测较大的对象。因为SSD会根据基本网络块和每一个多尺度特征块生成不一样数量的不一样大小的锚定框，而后预测锚定框的类别和偏移量（即预测的边界框），以便检测不一样大小的对象， SSD是一种多尺度目标检测模型。github

3.1 类别预测层

将对象类别的数量设置为q。加上表示背景的锚框0，锚框类别的数量为 q + 1 q+1 q+1。将要素图的高度和宽度设置为 h h h 和 w w w。若是咱们以每一个元素为中心生成 a a a 锚框，咱们总共须要分类 h w a hwa hwa 锚盒。若是咱们对输出使用彻底链接的层（FCN），则可能会致使模型参数过多。可使用类别预测层下降模型的复杂度，使用卷积层来保持输入的高度和宽度。所以，输出和输入与沿特征图的宽度和高度的空间坐标一一对应。express

定义类别预测层，指定参数后 a a a 和 q q q ，它使用 3 × 3 3×3 3×3 卷积padding为1的卷积层。此卷积层的输入和输出的高度和宽度保持不变。编程

def cls_predictor(num_anchors, num_classes):
    return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3, padding=1)

3.2 边界预测层

边界框预测层的设计相似于类别预测层的设计。惟一的不一样是，在这里，咱们须要为每一个锚框预测4个偏移，而不是q+1类别。数组

def bbox_predictor(num_anchors):
    return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)

3.3 级联的多尺度预测

SSD使用基于多个比例的特征图来生成锚框并预测其类别和偏移量。由于针对不一样比例的特征图，以同一元素为中心的锚框的形状和数量不一样，因此不一样比例的预测输出可能具备不一样的形状。网络

咱们使用相同的一批数据来构建两种不一样的尺度的特征映射， Y 1 Y1 Y1和 Y 2 Y2 Y2。在此， Y 2 Y2 Y2 高度和宽度为 Y 1 Y1 Y1的一半。以类别预测为例，咱们假设 Y 1 Y1 Y1和 Y 2 Y2 Y2特征图中的每一个元素都会生成五个（Y1）或三个（Y2）锚点框。当有10个对象类别时，类别预测输出通道的数量分别为 5 × ( 10 + 1 ) = 55 5×(10+1)=55 5×(10+1)=55和 3 × ( 10 + 1 ) = 33 3×(10+1)=33 3×(10+1)=33 。预测输出的格式为（批量大小，通道数，高度，宽度）。如您所见，除了批量大小，其余维度的大小都不一样。所以，咱们必须将它们转换为一致的格式，并合并多个尺度的预测，以利于后续计算。app

def forward(x, block):
    block.initialize()
    return block(x)

Y1 = forward(np.zeros((2, 8, 20, 20)), cls_predictor(5, 10))
Y2 = forward(np.zeros((2, 16, 10, 10)), cls_predictor(3, 10))
(Y1.shape, Y2.shape)

# ((2, 55, 20, 20), (2, 33, 10, 10))

通道尺寸包含全部具备相同中心的锚框的预测。咱们首先将通道尺寸移动到最终尺寸。因为全部规模的批次大小均相同，所以咱们能够将预测结果转换为二进制格式（批次大小，高度 × 宽度 × 通道数）

def flatten_pred(pred):
    return npx.batch_flatten(pred.transpose(0, 2, 3, 1))

def concat_preds(preds):
    return np.concatenate([flatten_pred(p) for p in preds], axis=1)

所以，不一样形状的 Y 1 Y1 Y1和 Y 2 Y2 Y2，咱们仍然能够级联为同一批次的两个不一样尺度下的预测结果。

3.4 高度和宽度下采样块

对于多尺度物体检测，咱们定义如下down_sample_blk块，将高度和宽度减小50％。该块由两个 3 × 3 3×3 3×3的卷积层，以及一个 2 × 2 2×2 2×2 步长为2的最大池化层串联组成。

def down_sample_blk(num_channels):
    blk = nn.Sequential()
    for _ in range(2):
        blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
                nn.BatchNorm(in_channels=num_channels),
                nn.Activation('relu'))
    blk.add(nn.MaxPool2D(2))
    return blk

经过测试高度和宽度下采样块中的正向计算，咱们能够看到它改变了输入通道的数量并将高度和宽度减半。

forward(np.zeros((2, 3, 20, 20)), down_sample_blk(10)).shape

# (2, 10, 10, 10)

3.5 基本网络块

基本网络块用于从原始图像提取特征。为了简化计算，咱们将构建一个小的基础网络。该网络由串联链接的三个高度和宽度下采样块组成，所以它在每一步将通道数量加倍。当咱们输入具备形状的原始图像时 256 × 256 256×256 256×256 ，基础网络模块会输出形状为 32 × 32 32×32 32×32 .

def base_net():
    blk = nn.Sequential()
    for num_filters in [16, 32, 64]:
        blk.add(down_sample_blk(num_filters))
    return blk

forward(np.zeros((2, 3, 256, 256)), base_net()).shape

# (2, 64, 32, 32)

3.6完整模型

SSD型号总共包含五个模块。每一个模块输出一个特征图，用于生成锚框并预测这些锚框的类别和偏移量。第一个模块是基础网络块，第二到四个模块是高度和宽度下采样块，第五个模块是全局最大池化层，将高度和宽度减少到1。

def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 4:
        blk = nn.GlobalMaxPool2D()
    else:
        blk = down_sample_blk(128)
    return blk

咱们将为每一个模块定义正向计算过程。与先前描述的卷积神经网络相反，该模块不只返回Y经过卷积计算输出的特征图，并且还返回从中生成的当前比例的锚点框Y及其预测的类别和偏移量。

def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = npx.multibox_prior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    return (Y, anchors, cls_preds, bbox_preds)

多尺度特征块越靠近图的顶部，它检测到的对象越大，它必须生成的锚点框也越大。在这里，咱们首先将0.2到1.05的间隔分红五个相等的部分，以肯定较小的锚框给定不一样的尺寸：0.二、0.3七、0.54。而后，根据 0.2 × 0.37 = 0.272 \sqrt{0.2 \times 0.37} = 0.272 0.2×0.37 =0.272, 0.37 × 0.54 = 0.447 \sqrt{0.37 \times 0.54} = 0.447 0.37×0.54 =0.447进行分割。

sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],[0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1

完成完整的模型TinySDD

class TinySSD(nn.Block):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        for i in range(5):
            # 使用setattr赋值语句绑定函数，元编程
            setattr(self, f'blk_{i}', get_blk(i))
            setattr(self, f'cls_{i}', cls_predictor(num_anchors, num_classes))
            setattr(self, f'bbox_{i}', bbox_predictor(num_anchors))

    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        for i in range(5):
            # 经过getattr(self, 'blk_%d' % i)获取函数
            X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
                getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
        # 0表示批次大小保持不变
        anchors = np.concatenate(anchors, axis=1)
        cls_preds = concat_preds(cls_preds)
        cls_preds = cls_preds.reshape(cls_preds.shape[0], -1, self.num_classes + 1)
        bbox_preds = concat_preds(bbox_preds)
        return anchors, cls_preds, bbox_preds

如今，咱们建立一个SSD模型实例，并使用它对X高度为256像素的图像minibatch进行正向计算。正如咱们先前验证的那样，第一个模块输出具备如下形状的特征图： 32×32 。由于模块2到4是高度和宽度向下采样块，模块5是全局池化层，而且要素图中的每一个元素都用做4个锚点框的中心，总共 (322+162+82+42+1)×4=5444 在五个尺度上为每一个图像生成锚框。

net = TinySSD(num_classes=1)
net.initialize()
X = np.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)

print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)

4.训练

4.1 初始化

获取数据集，并初始化参数定义优化函数

batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)

device, net = npx.gpu(), TinySSD(num_classes=1)
net.initialize(init=init.Xavier(), ctx=device)
trainer = gluon.Trainer(net.collect_params(), 'sgd', { 'learning_rate': 0.2, 'wd': 5e-4})

4.2 定义损失函数以及评估

对象检测受到两种损失。首先是锚框类别的损失。为此，咱们能够简单地重用咱们在图像分类中使用的交叉熵损失函数。第二个损失是正锚框偏移损失。偏移量预测是一个归一化问题，使用 L1 范数损失，是预测值和真实值之间的差的绝对值。

cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
bbox_loss = gluon.loss.L1Loss()

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    cls = cls_loss(cls_preds, cls_labels)
    bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
    return cls + bbox

咱们可使用准确率来评估分类结果。当咱们使用 L1 范数损失，咱们将使用平均绝对偏差来评估边界框预测结果。

def cls_eval(cls_preds, cls_labels):
    # argmax指定预测结果维度
    return float((cls_preds.argmax(axis=-1).astype(cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return float((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

4.3 训练模型

anchors在模型的正向计算过程当中生成多尺度锚定框，并预测每一个锚定框的类别（cls_preds）和偏移（bbox_preds）。而后，咱们根据标签信息标记每一个生成的锚框的类别（cls_labels）和偏移量（bbox_labels）。最后，咱们使用预测的和标记的类别和偏移值来计算损失函数。为了简化代码，咱们在这里不评估训练数据集。

def train(train_iter, num_epochs, loss_fn, device):
    timer = d2l.Timer()
    cls_err_lst, bbox_mae_lst =[], []
    for epoch in range(num_epochs):
        # accuracy_sum, mae_sum, num_examples, num_labels
        metric = d2l.Accumulator(4)
        train_iter.reset()  # Read data from the start.
        for batch in train_iter:
            timer.start()
            X = batch.data[0].as_in_ctx(device)
            Y = batch.label[0].as_in_ctx(device)
            with autograd.record():
                # 生成多尺度锚框并预测每一个类别和偏移量
                anchors, cls_preds, bbox_preds = net(X)
                # 每个锚框的类别和偏移
                bbox_labels, bbox_masks, cls_labels = npx.multibox_target(
                    anchors, Y, cls_preds.transpose(0, 2, 1))
                # 计算类别和偏移的损失
                l = loss_fn(cls_preds, cls_labels, bbox_preds, bbox_labels,
                              bbox_masks)
            l.backward()
            trainer.step(batch_size)
            metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size,
                       bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                       bbox_labels.size)
        cls_err_lst.append(1-metric[0]/metric[1])
        bbox_mae_lst.append(metric[2]/metric[3])
    print(f'class err {cls_err_lst[-1]:.2e}, bbox mae {bbox_mae_lst[-1]:.2e}')
    print(f'{train_iter.num_image/timer.stop():.1f} examples/sec on '
          f'{str(device)}')
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=list(range(1, num_epochs+1)), y=cls_err_lst, name='class error', 
                  mode='lines+markers'))
    fig.add_trace(go.Scatter(x=list(range(1, num_epochs+1)), y=bbox_mae_lst, name='bbox mae',
                  mode='lines+markers'))
    fig.update_layout(width=800, height=480, xaxis_title='epoch', xaxis_range=[1,num_epochs])
    fig.show()
    
    
num_epochs = 20
train(train_iter, num_epochs, calc_loss, device)

5.预测

在预测阶段，咱们要检测图像中全部感兴趣的对象。在下面，咱们阅读测试图像并转换其大小。而后，将其转换为卷积层所需的四维格式。

img = image.imread('img/banana.jpg')
feature = image.imresize(img, 256, 256).astype('float32')
X = np.expand_dims(feature.transpose(2, 0, 1), axis=0)

建立一个函数用于基于锚点框及其预测的偏移量来预测边界框。而后，咱们使用非最大抑制来删除类似的边界框。

def predict(X):
    anchors, cls_preds, bbox_preds = net(X.as_in_ctx(device))
    cls_probs = npx.softmax(cls_preds).transpose(0, 2, 1)
    output = npx.multibox_detection(cls_probs, bbox_preds, anchors)
    idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
    return output[0, idx]

output = predict(X)

最后，咱们将置信水平至少为0.3的全部边界框显示为最终输出。

def display(img, output, threshold, scale=1.5):        
    fig = go.Figure()
    fig.add_trace(go.Image(z=img.asnumpy()))
    score_lst, x, y =[], [], [] 
    for row in output:
        score = float(row[1])
        if score < threshold:
            continue
        h, w = img.shape[0:2]
        bbox = [row[2:6] * np.array((w, h, w, h), ctx=row.ctx)]
        x0, y0, x1, y1 = bbox[0]
        score_lst.append(f'{score:.2f}')
        x.append(float(x0)+img.shape[0]*0.04)
        y.append(float(y0)+img.shape[0]*0.02)
        fig.add_shape(type="rect",x0=x0,y0=y0,x1=x1,y1=y1,line=dict(color="white"))
    fig.add_trace(go.Scatter(mode='text', x=x, y=y, text=score_lst, textfont={ 'color':'red','size':10}))
    img_bytes = fig.to_image(format="png", scale=scale, engine="kaleido")
    return img_bytes

Image(display(img, output, threshold=0.9))

6.参考

https://d2l.ai/chapter_computer-vision/ssd.html

7.代码

github