Mxnet (35): 使用全卷积网络（FCN）进行语义分割

1. 转置卷积

装置卷积层用来增长输入的宽和高。html

让咱们考虑一个基本状况，输入和输出通道均为1，填充为0，跨度为1。下图说明了转置卷积如何经过 2 × 2 2×2 2×2内核是根据 2 × 2 2×2 2×2输入矩阵获得 3 x 3 3x3 3x3的输出python

将上面的过程转化为代码以下，其中kernel为K，输入为X：git

def trans_conv(X, K):
    h, w = K.shape
    Y = np.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
    for i in range(X.shape[0]):
        for j in range(X.shape[1]):
            Y[i: i + h, j: j + w] += X[i, j] * K
    return Y

X = np.array([[0, 1], [2, 3]])
K = np.array([[0, 1], [2, 3]])
trans_conv(X, K)

使用gluon的nn.Conv2DTranspose以得到相同的结果。如 nn.Conv2D，输入和内核均应为4D张量。github

X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2)
tconv = nn.Conv2DTranspose(1, kernel_size=2)
tconv.initialize(init.Constant(K))
tconv(X)

1.1 填充，步幅和通道设置

咱们将填充元素应用于卷积中的输入，而将它们应用于转置卷积中的输出。一种 1 × 1 1×1 1×1 padding表示咱们首先按正常方式计算输出，而后删除第一行/最后一行。网络

tconv = nn.Conv2DTranspose(1, kernel_size=2, padding=1)
tconv.initialize(init.Constant(K))
tconv(X)

# array([[[[4.]]]])

步幅也适用于输出app

tconv = nn.Conv2DTranspose(1, kernel_size=2, strides=2)
tconv.initialize(init.Constant(K))
tconv(X)

还能够用来还原通道，下降通道数，下面的转置卷积对形状的更改和上面的卷积彻底相反dom

X = np.random.uniform(size=(1, 10, 16, 16))
conv = nn.Conv2D(20, kernel_size=5, padding=2, strides=3)
tconv = nn.Conv2DTranspose(10, kernel_size=5, padding=2, strides=3)
conv.initialize()
tconv.initialize()
tconv(conv(X)).shape == X.shape

# True

2. 全卷积网络（FCN）

全卷积网络使用卷积神经网络将图像像素转换为像素类别。与先前介绍的卷积神经网络不一样，FCN经过转置的卷积层将中间层特征图的高度和宽度转换回输入图像的大小，从而使预测与输入图像中的输入图像具备一一对应的关系。空间尺寸（高度和宽度）。给定空间维度上的位置，通道维度的输出将是对应于该位置的像素的类别预测。ide

2.1 建立模型

全卷积网络首先使用卷积神经网络来提取图像特征，而后经过1×1 卷积层将通道数转换为类别数。最后经过使用转置的卷积层将特征图的高度和宽度转换为输入图像的大小。模型输出与输入图像具备相同的高度和宽度，而且在空间位置上具备一一对应的关系。最终输出通道包含相应空间位置的像素的类别预测。函数

下面使用在ImageNet上预训练的ResNet-18模型进行微调。模型成员变量的最后两层features是全局平均池化层 GlobalAvgPool2D和示例扁平化层Flatten。该 output模块包含用于输出的彻底链接层。彻底卷积网络不须要这些层。测试

pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True)
pretrained_net.features[-4:], pretrained_net.output

重新建立全卷积网络实例net。它重复pretrained_net的除了最后两层的全部神经层features的实例成员变量的模型参数。

net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
    net.add(layer)

给定分别为320和480的高度和宽度的输入，正向计算将把输入的高度和宽度减少为原来的1/32：10和15。

X = np.random.uniform(size=(1, 3, 320, 480))
net(X).shape

# (1, 512, 10, 15)

接下来须要经过 1 × 1 1×1 1×1卷积层将通道数输出为数据的类别数量,这里Pascal VOC2012的种类为21。而且经过转置卷积层将宽高放大为原来的32倍。只要将步幅设置为32，并将padding设置为 32 / 2 = 16 32/2=16 32/2=16,便可达到方法32倍的效果，将kernel设置为 64 × 64 64×64 64×64

num_classes = 21
net.add(
    nn.Conv2D(num_classes, kernel_size=1),
    nn.Conv2DTranspose(num_classes, kernel_size=64, padding=16, strides=32)
)

2.2 初始化转置卷积层

咱们已经知道转置的卷积层能够放大特征图。在图像处理中，有时咱们须要放大图像，即上采样。上采样的方法不少，一种常见的方法是双线性插值。简单来讲, 为了得到输出图像的像素坐标 ( x , y ) (x, y) (x,y), 首先将坐标映射到输入图像的坐标 ( x ′ , y ′ ) (x', y') (x′,y′)。而后在输入图像上找到4个最接近 ( x ′ , y ′ ) (x', y') (x′,y′)的坐标，而后经过 ( x ′ , y ′ ) (x', y') (x′,y′)和它附近的四个像素的相对距离计算 ( x , y ) (x, y) (x,y) 。下面构建一个函数，经过双线插值进行上采样。

def bilinear_kernel(in_channels, out_channels, kernel_size):
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1
    else:
        center = factor - 0.5
    og = (np.arange(kernel_size).reshape(-1, 1),
          np.arange(kernel_size).reshape(1, -1))
    filt = (1 - np.abs(og[0] - center) / factor) * (1 - np.abs(og[1] - center) / factor)
    weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size))
    weight[range(in_channels), range(out_channels), :, :] = filt
    return np.array(weight)

如今，咱们将对由转置卷积层实现的双线性插值上采样进行实验。构造一个转置的卷积层，将输入的高度和宽度放大2倍，并使用函数初始化其卷积内核。

conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)
conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))

读取图像X并将升采样结果记录为Y。为了打印图像，咱们须要调整通道尺寸的位置。

img = image.imread('img/catdog.jpg')
X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0)/255
Y = conv_trans(X)
out_img = Y[0].transpose(1, 2, 0)
print('输入图片形状:', img.shape)
print('处理过得输出形状:', out_img.shape)
px.imshow(out_img.asnumpy(), width=img.shape[1]/2, height=img.shape[0]/2)

初始化转置卷积层和 1 × 1 1×1 1×1 卷积层

W = bilinear_kernel(num_classes, num_classes, 64)
net[-1].initialize(init.Constant(W))
net[-2].initialize(init=init.Xavier())

3. 训练

此处的损失函数和准确度计算与图像分类中使用的损失函数和准确度计算没有实质性区别。因为咱们使用转置卷积层的通道来预测像素类别，所以在axis=1中指定了（通道尺寸）选项SoftmaxCrossEntropyLoss。另外，该模型基于每一个像素的预测类别是否正确来计算精度。

def accuracy(y_hat, y): 
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = y_hat.argmax(axis=1)
    cmp = y_hat.astype(y.dtype) == y
    return float(cmp.sum())

def train_batch(net, features, labels, loss, trainer, devices, split_f=d2l.split_batch):
    X_shards, y_shards = split_f(features, labels, devices)
    with autograd.record():
        pred_shards = [net(X_shard) for X_shard in X_shards]
        ls = [loss(pred_shard, y_shard) for pred_shard, y_shard
              in zip(pred_shards, y_shards)]
    for l in ls:
        l.backward()
    # ignore_stale_grad表明能够使用就得梯度参数
    trainer.step(labels.shape[0], ignore_stale_grad=True)
    train_loss_sum = sum([float(l.sum()) for l in ls])
    train_acc_sum = sum(accuracy(pred_shard, y_shard)
                        for pred_shard, y_shard in zip(pred_shards, y_shards))
    return train_loss_sum, train_acc_sum

def train(net, train_iter, test_iter, loss, trainer, num_epochs,
               devices=d2l.try_all_gpus(), split_f=d2l.split_batch):
    num_batches, timer = len(train_iter), d2l.Timer()
    epochs_lst, loss_lst, train_acc_lst, test_acc_lst = [],[],[],[]
    for epoch in range(num_epochs):
        metric = d2l.Accumulator(4)
        for i, (features, labels) in enumerate(train_iter):
            timer.start()
            l, acc = train_batch(
                net, features, labels, loss, trainer, devices, split_f)
            metric.add(l, acc, labels.shape[0], labels.size)
            timer.stop()
            if (i + 1) % (num_batches // 5) == 0:
                epochs_lst.append(epoch + i / num_batches)
                loss_lst.append(metric[0] / metric[2])
                train_acc_lst.append(metric[1] / metric[3])
        test_acc_lst.append(d2l.evaluate_accuracy_gpus(net, test_iter, split_f))
        print(f"[epock {epoch+1}] train loss: {metric[0] / metric[2]:.3f} train acc: {metric[1] / metric[3]:.3f}", 
              f" test_loss: {test_acc_lst[-1]:.3f}")
    print(f'loss {metric[0] / metric[2]:.3f}, train acc '
          f'{metric[1] / metric[3]:.3f}, test acc {test_acc_lst[-1]:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec on '
          f'{str(devices)}')
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=epochs_lst, y=loss_lst, name='train loss'))
    fig.add_trace(go.Scatter(x=epochs_lst, y=train_acc_lst, name='train acc'))
    fig.add_trace(go.Scatter(x=list(range(1,len(test_acc_lst)+1)), y=test_acc_lst, name='test acc'))
    fig.update_layout(width=800, height=480, xaxis_title='epoch', yaxis_range=[0, 1])
    fig.show()

加载数据,比较费内存，选取16一组：

batch_size = 16
train_iter, test_iter = load_data_voc(batch_size, crop_size)

因为图片都比较大会加载在内存中，若是内存不够用，能够考虑减小数据量。

num_epochs, lr, wd, devices = 5, 0.1, 1e-3, [npx.gpu()]
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
net.collect_params().reset_ctx(devices)
trainer = gluon.Trainer(net.collect_params(), 'sgd', { 'learning_rate': lr, 'wd': wd})
train(net, train_iter, test_iter, loss, trainer, num_epochs, devices)

4.预测

在预测期间，咱们须要标准化每一个通道中的输入图像，并将它们转换为卷积神经网络所需的四维输入格式。

def predict(img):
    X = test_iter._dataset.normalize_image(img)
    X = np.expand_dims(X.transpose(2, 0, 1), axis=0)
    pred = net(X.as_in_ctx(devices[0])).argmax(axis=1)
    return pred.reshape(pred.shape[1], pred.shape[2])

def label2image(pred):
    colormap = VOC_COLORMAP.as_in_ctx(devices[0])
    X = pred.astype('int32')
    return colormap[X, :]

获取测试数据，并进行预测。为模型使用步幅为32的转置卷积层，因此当输入图像的高度或宽度不能被32整除时，转置卷积层输出的高度或宽度会偏离输入图像的大小。为了解决此问题，咱们能够在图像中裁剪多个具备高和宽为32的整数倍的矩形区域，而后对这些区域中的像素执行正向计算。组合时，这些区域必须彻底覆盖输入图像。当像素被多个区域覆盖时，在不一样区域的正向计算中输出的转置卷积层的平均值能够用做softmax操做的输入，以预测类别。

test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
    crop_rect = (0, 0, 480, 320)
    X = image.fixed_crop(test_images[i], *crop_rect)
    pred = label2image(predict(X))
    imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)]
Image(show_imgs(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=1.5))

第一排原图，第二排预测图，第三排是标签。

5.参考

https://d2l.ai/chapter_computer-vision/fcn.html

6.代码

github