Visual Question Answering with Memory-Augmented Networks

时间 2019-11-29

标签 visual question answering memory augmented networks 繁體版

原文原文链接

Visual Question Answering with Memory-Augmented Networks
2018-05-15 20:15:03
算法

Motivation：网络

虽然 VQA 已经取得了很大的进步，可是这种方法依然对彻底 general，freeform VQA 表现不好，做者认为是由于以下两点：学习

　　1. deep models trained with gradient based methods learn to respond to the majority of training data rather than specific scarce exemplars ; 编码

　　用梯度降低的方法训练获得的深度模型，对主要的训练数据有较好的相应，可是对特定的稀疏样本却不是；spa

　　2. existing VQA systems learn about the properties of objects from question-answer pairs, sometimes indepently of the image. code

　　选择性的关注图像中的某些区域是很重要的策略。orm

咱们从最近的 memory-augmented neural networks 以及 co-attention mechanism 获得启发，本文中，咱们利用 memory-networks 来记忆 rare events，而后用 memory-augmented networks with attention to rare answers for VQA. blog

The Proposed Algorithm : 图片

本文的算法流程如上图所示，首先利用 embedding 的方法，提取问题和图像的 feature，而后进行 co-attention 的学习，而后将两个加权后的feature进行组合，而后输入到 memory network 中，最终进行答案的选择。ci

Image Embedding：用 pre-trained model 进行特征的提取；

Question Embedding：用双向 LSTM 网络进行语言特征的学习；

Sequential Co-attention：

这里的协同 attention 机制，考虑到图像和文本共同的特征，相互影响，获得共同的注意力机制。咱们根据视觉特征和语言特征的平均值，进行点乘，获得一个 base vector m0 ：

咱们用一个两层的神经网络进行 soft attention 的计算。对于 visual attention，the soft attention 以及加权后的视觉特征向量分别为：

其中 Wv， Wm，Wh 都表示 hidden states。相似的，咱们计算加权后的问题特征向量，以下：

咱们将加权后的 v 和 q 组合，用来表示输入图像和问题对，图4，展现了 co-attention 机制的整个过程。

Memory Augmented Network：

The RNNs lack external memory to maintain a long-term memory for scarce training data. This paper use a memory-augmented NN for VQA.

特别的，咱们采用了标准的 LSTM 模型做为 controller，起做用是 receives input data，而后跟外部记忆模块进行交互。外部记忆，Mt，是有一系列的 row vectors 做为 memory slots。

xt 表明的是视觉特征和文本特征的组合；yt 是对应的编码的问题答案（one-hot encoded answer vector）。而后将该 xt 输入到 LSTM controller，如：

对于从外部记忆单元中读取，咱们将 the hidden state ht 做为 Mt 的 query。首先，咱们计算搜索向量 ht 和记忆中每一行的余弦距离：

而后，咱们经过 the cosine distance 用 softmax 计算一个 read weight vector wr：

有这些 read-weights, 一个新的检索的记忆 rt 能够经过下面的式子获得：

最后，咱们将 the new memory vector rt 和 controller hidden state ht 组合，而后产生 the output vector ot for learning classifier.

咱们采用 the usage weights wu 来控制写入到 memory。咱们经过衰减以前的 state 来更新 the usage weights ：

为了计算 the write weights，咱们引入一个截断机制来更新 the least-used positions。此处，咱们采用 m(v, n) 来表示 the n-th smallest element of a vector v. 咱们采用 a learnable sigmoid gate parameter 来计算以前的 read weights 和 usage weights 的 convex combination：

A larger n results in maintaining a longer term of memory of scarce training data. 跟 LSTM 内部的记忆单元相比，这里的两个参数均可以用来调整 the rate of writing to exernal memory. 这给咱们更多的自由来调整模型的更新。公式（12）中输出的隐层状态 ht 能够根据 the write weights 写入到 memory 中：

Answer Reasoning：

有了 the hidden state ht 以及那个外部记忆单元中获得的 the reading memory rt，咱们将这两个组合起来，做为当前问题和图片的表达，输入到分类网络中，而后获得答案的分布。

--- Done ！