基于飞桨复现 CVPR2018 Relation Net的全程解析

【飞桨开发者说】佟兴宇，北京航空航天大学硕士，机器视觉算法工程师。php

Relation Net 是 CVPR2018的一篇论文，论文连接：git

https://arxiv.org/pdf/1711.06025.pdfgithub

论文做者发现，在视觉识别任务中，训练模型时须要大量标注过的图片，并迭代屡次去训练参数。每当新增物体类别，都须要花费大量时间去标注，可是有一些新兴物体类别和稀有物体类别可能不存在大量标注过的图片，从而影响模型训练效果。反观人类，只要不多的认知学习就可实现小样本（FSL）和无样本学习(ZSL)。算法

好比：小孩子只要在一张图片或一本书里认识了斑马，或者只是听到描述斑马是一种”条纹马”，就能够毫无困难的识别出斑马这种动物。为了解决深度学习中模型样本少致使的分类效果差的问题，同时又受到人类的小样本和无样本学习能力带来的启发，小样本学习又恢复了一些热度。网络

深度学习中的Fine-tune技术能够用于一些样本比较少的状况，可是在只有一个或者几个样本的状况下，即便使用了数据加强和正则化技术，仍然会有过拟合的问题。目前其余的小样本学习的推理机制又比较复杂，因此论文做者提出了一个能够端到端训练，而且结构简单的模型Relation Net。框架

在 FSL 任务中，通常将数据集分为 Training set 、Support set 、Testing set。Support set和 Testing set有共同的标签；Training set里面不包涵 Support set和 Testing set的标签;在 Support set 中有 K 个标注过的数据和C个不一样的类别，则称做为 C-way K-shot。在训练的过程当中从 Training set 中选取 sample set /query set 对应Support set / Testing set，具体方法在文中的训练策略里会详细说明。ide

Relation Network由 embedding model 和 relation model 组成。Relation Network 的核心思想是：首先经过embedding model分别提取 support set 和 testing set中图像的特征图，而后将特征图中表明通道数的维度进行拼接，获得一个新的特征图。而后把新的特征图送入 relation model 进行运算获得 relation score，这个值表明了两张图的类似度。函数

下图为5-way 1-shot 的状况下接受1个样本的网络结构与流程。5张sample set 中的图片与1张 query set 中的图片会分别的经过 embedding model 提取特征并拼接，获得5个新的特征图，而后送入 Relation Net 进行计算 relation score，最后会获得一个 one-shot 的向量，分数最高的表明对应的类别。性能

训练使用的损失函数也比较简单，使用均方偏差做为损失函数。公式中 ri,j表明图片 i与 j 的类似度。yi 与 yj表明图片的真实标签。学习

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

基于飞桨复现

Relation Network

下面我将复现的技术细节与各位开发者分享，Relation Network 模型结构定义请查看：

https://github.com/txyugood/paddle_RN_FSL/blob/master/RelationNet.py

1. 搭建 Relation Network 网络

模型由embedding model 和 relation model 两部分组成，两个网络都主要由【Conv+BN+Relu】模块组成。所以先定义一个 BaseNet类，并在其中实现conv_bn_layer方法，代码以下：

class BaseNet:
    def conv_bn_layer(self,
                      input,
                      num_filters,
                      filter_size,
                      stride=1,
                      groups=1,
                      padding=0,
                      act=None,
                      name=None,
                      data_format='NCHW'):
        n = filter_size * filter_size * num_filters
        conv = fluid.layers.conv2d(
            input=input,
            num_filters=num_filters,
            filter_size=filter_size,
            stride=stride,
            padding=padding,
            groups=groups,
            act=None,
            param_attr=ParamAttr(name=name + "_weights", initializer=fluid.initializer.Normal(0,math.sqrt(2. / n))),
            bias_attr=ParamAttr(name=name + "_bias",
                                initializer=fluid.initializer.Constant(0.0)),
            name=name + '.conv2d.output.1',
            data_format=data_format)

        bn_name = "bn_" + name

        return fluid.layers.batch_norm(
            input=conv,
            act=act,
            momentum=1,
            name=bn_name + '.output.1',
            param_attr=ParamAttr(name=bn_name + '_scale',
                                 initializer=fluid.initializer.Constant(1)),
            bias_attr=ParamAttr(bn_name + '_offset',
                                initializer=fluid.initializer.Constant(0)),
            moving_mean_name=bn_name + '_mean',
            moving_variance_name=bn_name + '_variance',
            data_layout=data_format)

飞桨支持静态图和动态图两种网络定义模式，这里我选用的静态图。以上代码定义了一个卷积神经网络中最常常出现的 conv_bn 层，但要注意的是 batch_norm 层的 momentum 设置为1，实现的效果就是不记录全局均值和方差。

具体参数含义以下：

Input：传入待卷积处理的张量对象；
num_filter：卷积核数量(输出的卷积结果的通道数)；
filter_size：卷积核尺寸；
stride：卷积步长；
groups：分组卷积的组数量；
padding：填充大小，这里设置为0,表明卷积后不填充；
act:接在 BN 层后的激活函数，若是为 None，则不使用激活函数；
name:在运算图中的对象名称。

接着咱们定义 Relation Network 中的 embedding model 部分。

class EmbeddingNet(BaseNet):
    def net(self,input):
        conv = self.conv_bn_layer(
            input=input,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='embed_conv1')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='embed_conv2')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=1,
            act='relu',
            name='embed_conv3')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=1,
            act='relu',
            name='embed_conv4')
        return conv

在上述代码中建立一个EmbeddingNet类，继承BaseNet类，它就继承了conv_bn_layer方法。在EmbeddingNet中定义net方法，参数 input 表明输入的图像张量，这个方法用来建立网络的静态图。

输入的 input 先通过一个【Conv+BN+relu】的模块获得特征图embed_conv1；而后进行了一次最大值池化操做，池化的做用的是在保留重要特征的前提下缩小特征图，后面的卷积和池化操做做用与此相同；最后embed_conv4输出的特征图形状是[-1,64,19,19]一共4个维度，第1个纬度表明了 batch_size,由于 batch_size 在建立静态网络时是不肯定的，因此用-1来表示能够是任意值。第2个纬度表明了特征的图的通道数，通过 embedding model后，特征图的通道数为64。最后第3和第4个维度表明了特征图的宽度和高度，这里是19x19。

Relation model 代码部分以下：

class RelationNet(BaseNet):
    def net(self, input, hidden_size):
        conv = self.conv_bn_layer(
            input=input,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='rn_conv1')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        conv = self.conv_bn_layer(
            input=conv,
            num_filters=64,
            filter_size=3,
            padding=0,
            act='relu',
            name='rn_conv2')
        conv = fluid.layers.pool2d(
            input=conv,
            pool_size=2,
            pool_stride=2,
            pool_type='max')
        fc = fluid.layers.fc(conv,size=hidden_size,act='relu',
                             param_attr=ParamAttr(name='fc1_weights',
                                                  initializer=fluid.initializer.Normal(0,0.01)),
                             bias_attr=ParamAttr(name='fc1_bias',
                                                 initializer=fluid.initializer.Constant(1)),
                             )
        fc = fluid.layers.fc(fc, size=1,act='sigmoid',
                             param_attr=ParamAttr(name='fc2_weights',
                                                  initializer=fluid.initializer.Normal(0,0.01)),
                             bias_attr=ParamAttr(name='fc2_bias',
                                                 initializer=fluid.initializer.Constant(1)),
                             )
        return fc

建立一个RelationNet类，它一样继承于 BaseNet 类，继承了conv_bn_layer方法。在 net 方法中，模型的前面几层与 embeding model 中相似使用【Conv+BN+Relu】模块进行特征提取，在最后使用两层全链接层，将特征值映射为一个标量relation score，表明了两个图片的类似度。

在训练过程当中，sample set 中图片和 query set 的图片通过 embedding model后都获得了形状为[-1,64,19,19]的特征图，在送入 relation model 以前须要进行拼接，这段代码略有些复杂，下面我分段解释一下。

sample_image = fluid.layers.data('sample_image', shape=[3, 84, 84], dtype='float32')
query_image = fluid.layers.data('query_image', shape=[3, 84, 84], dtype='float32')

sample_query_image = fluid.layers.concat([sample_image, query_image], axis=0)
sample_query_feature = embed_model.net(sample_query_image)

这部分代码是将 sample image和 query image的张量在batch_size 的纬度上拼接获得张量sample_query_image，一块儿送到 embedding model 中去提取特征，获得sample_query_feature。

sample_batch_size = fluid.layers.shape(sample_image)[0]
query_batch_size = fluid.layers.shape(query_image)[0]

这部分代码取 image 张量的0维度做为 batch_size。

sample_feature = fluid.layers.slice(
                sample_query_feature,
                axes=[0],
                starts=[0],
                ends=[sample_batch_size])
if k_shot > 1:
# few_shot
      sample_feature = fluid.layers.reshape(sample_feature, shape=[c_way, k_shot, 64, 19, 19])
      sample_feature = fluid.layers.reduce_sum(sample_feature, dim=1)
query_feature = fluid.layers.slice(
      sample_query_feature,
      axes=[0],
      starts=[sample_batch_size],
      ends=[sample_batch_size + query_batch_size])

因为以前图片进行了拼接，因此在特征以后，一样须要在sample_query_feature的 batch_size 对应的0维度上进行切片，分别获得sample_feature 和query_feature。这里若是 K-shot 大于1时，须要对 sample_feature改变形状，而后在 K-shot 对应的1维度上对 K-shot 个张量求和并删除该维度，这时 sample_feature的形状就变成为[C-way,64,19,19]。这时 sample_batch_size 的值应该为 C-way。

sample_feature_ext = fluid.layers.unsqueeze(sample_feature, axes=0)
query_shape = fluid.layers.concat(
       [query_batch_size, fluid.layers.assign(np.array([1, 1, 1,1]).astype('int32'))])
sample_feature_ext = fluid.layers.expand(sample_feature_ext, query_shape)

由于 sample set 中的每一张图片特征都须要与 C 个类型的图片特征进行拼接，因此这里经过unsqueeze新增一个维度。根据 expand 接口的参数要求，这里新建一个 query_shape 张量实现复制 sample_feature 张量query_batch_size 次获得一个形状为[query_batch_size, sample_batch_size, 64, 19, 19]的张量。

query_feature_ext = fluid.layers.unsqueeze(query_feature, axes=0)
if k_shot > 1:
sample_batch_size = sample_batch_size / float(k_shot)
sample_shape = fluid.layers.concat(
      [sample_batch_size, fluid.layers.assign(np.array([1, 1, 1, 1]).astype('int32'))])
query_feature_ext = fluid.layers.expand(query_feature_ext, sample_shape)

同上面的操做同样，query set 的特征也须要新增一维度，这里须要复制 sample_batch_size 次。值得注意的是，若是 k-shot 大于1的状况下，由于以前已经作过 reduce_mean 操做，因此要使sample_batch_size除以 k-shot获得新的sample_batch_size。最后经过复制获得一个[sample_batch_size, query_batch_size, 64, 19, 19]的张量。

query_feature_ext = fluid.layers.transpose(query_feature_ext, [1, 0, 2, 3, 4])
relation_pairs = fluid.layers.concat([sample_feature_ext, query_feature_ext], axis=2)
relation_pairs = fluid.layers.reshape(relation_pairs, shape=[-1, 128, 19, 19])

最后经过transpose方法进行转置使sample_feature_ext和query_feature_ext形状一致，最后对两个特征进行拼接和修改形状获得一个形状为[query_batch_size x sample_batch_size, 128, 19, 19]的张量relation_pairs。

relation = RN_model.net(relation_pairs, hidden_size=8)
relation = fluid.layers.reshape(relation, shape=[-1, c_way])

最后将以前拼接的特征送入 relation model 模块，首先会获得一个query_batch_size x sample_batch_size长度的向量，而后改变形状获得[query_batch_size, sample_batch_size]的张量（sample_batch_size 实际上等于 C-way）, sample_batch_size长度的向量以 one-hot 的形式表示出每个 query image 的类别。

损失函数的代码以下：

one_hot_label = fluid.layers.one_hot(query_label, depth=c_way)
loss = fluid.layers.square_error_cost(relation, one_hot_label)
loss = fluid.layers.reduce_mean(loss)

首先将 query image 的标签 query_label 转换为 one-hot 的形式，以前获得的relation也是 one-hot的形式，而后计算relation和one_hot_label的MSE获得损失函数。

2. 训练策略

在 FSL 任务中，若是只使用 Support set 去训练，也能够对 Testing set 进行推理预测，可是因为Support set 中样本数量比较少，致使分类器的性能通常很差。所以通常使用Training set进行训练，这样分类器会有一个比较好的性能。这里有一个有效的方法，叫作 episode based training。

episode based training的实现步骤以下：

训练须要循环迭代 N 个 episode,每1个 episode 会在 training set 中随机选取 C 个类别的中的 K 个数据，组成1个sample set数据集。C和 K 对应 support set 中的 C-way K-shot，一共有 C x K个样本。
在 C 个类别中剩余的样本中随机选取几个样本做为 query set, 进行训练。

对于 5-way 1-shot学习，sample set 的 batch_size 选择的是5，query set 的 batch_size 选择的是15。对于5-way 5-shot学习，sample set 的 batch_size 选择的是25（每一个类别5张图），query set 的 batch_size 选择的10。

对于训练的优化器，选择的是 Adam优化，学习率设置为0.001。对于数据增广，在数据读取时对 sample set 和 query set 的图像都使用了 AutoAugment 的方法来增长数据的多样性。

3. 验证模型复现效果

验证时的数据集只使用了论文中实验用的 minImageNet,共有100个分类，每一个分类600张图片。这个100个分类分别划分为 training/validation/testing 三个数据集，数量分别为6四、16和20。

文章中提到模型在minImageNet的testing 数据集上准确率以下：

在5-way 1-shot 和 5-way 5-shot 分别达到了50.44和65.32左右的准确率。一样使用基于飞桨实现的 Relation Net 在minImageNet的testing 数据集上的

5-way 1-shot 准确率：

5-way 5-shot 准确率：

结果与论文中的准确率一致，模型复现完成。

代码地址：

https://github.com/txyugood/paddle_RN_FSL

如在使用过程当中有问题，可加入飞桨官方QQ群进行交流：1108045677。

若是您想详细了解更多飞桨的相关内容，请参阅如下文档。

下载安装命令

## CPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/cpu paddlepaddle

## GPU版本安装命令
pip install -f https://paddlepaddle.org.cn/pip/oschina/gpu paddlepaddle-gpu

>> 访问 PaddlePaddle 官网，了解更多相关内容。

官网地址：

https://www.paddlepaddle.org.cn

飞桨开源框架项目地址：

GitHub:

https://github.com/PaddlePaddle/Paddle

Gitee:

https://gitee.com/paddlepaddle/Paddle

END