Learning Conditioned Graph Structures for Interpretable Visual Question Answering

时间 2019-12-05

标签 learning conditioned graph structures interpretable visual question answering 繁體版

原文原文链接

Learning Conditioned Graph Structures for Interpretable Visual Question Answeringhtml

2019-05-29 00:29:43node

Paper：http://papers.nips.cc/paper/8054-learning-conditioned-graph-structures-for-interpretable-visual-question-answering.pdf git

Code：https://github.com/aimbrain/vqa-project github

1. Background and Motivation: 算法

最近的计算机视觉工做开始探索图像的高层表达（higher level representation of images），特别是利用 object detector 以及 graph-based structures 以进行更好的语义和空间图像理解。将图像表达为 graphs，能够显示的进行模型交互，经过 graph iterms（objects in the image）无缝进行信息的迁移。这种基于 graph 的技术已经在最近的 VQA 任务上应用上。这种方法的一个缺点是：the input graph structures are heavily engineered, image specific rather than question specific, and not easily transferable from abstract scenes to real images. 此外，very few approaches provide means to interpret the model's behavior, an essential aspect that is often lacking in deep learning models. 网络

本文贡献：现有的 VQA 方法并无尝试建模不一样物体之间的 semantic connections, 因此，本文尝试经过引入一个先验，来解决该问题。app

2. The Proposed Method : dom

本文所提出的方法如图 2 所示。咱们开发了一种 DNN 能够将 spatial，image and textual features 以一种新颖的方式进行组合，进行问题的回答。咱们的模型首先利用 Word embedding 和 RNN 模型来计算获得问题的表达，以及一系列的物体描述（object descriptor），包括：bounding box corrdinates, image features vectors。咱们的 graph learning module 而后基于给定的问题，去学习一个 adjacency matrix。这个邻接矩阵确保下一层 -the spatial graph convolutions- 可使其不但聚焦于 the objects，并且能够聚焦在于问题相关的物体关系上。咱们卷积图特征，随后通过 max-pooling，与 question embedding 组合在一块儿，利用一个简单的元素级相乘，从而能够预测一个答案。ide

2.1 Computing model inputs : 函数

第一个阶段是：计算输入图像和问题的 embedding。咱们将给定的图像，用物体检测算法转换为 K 个 visual features。物体检测模型是咱们方法的必要步骤，由于每个 bounding box 都将在 question specific graph representations 上做为一个节点。对于每个提出的 bounding box 都将会有一个 embedding，即：对应区域的卷积特征。利用这种 object features 在其余工做中已经被证实能够提高 VQA 的效果，由于这使得模型能够聚焦于 object-level feature，而不只仅是 CNN feature。对于每个 question，咱们利用一个 pre-trained Word embeddings 将问题转换为向量。而后用动态 GRU 模型，来进一步将句子编码为向量 q。

2.2 Graph Learner :

本小结是全文最重要的贡献部分了：图学习模块（the graph learner module）。该模块产生了一个基于 question 的图像的 graphical representation。该模块是比较 general 的，而且容易实现，学习到的复杂 feature 之间的关系，是可解释的且问题相关的。学习到的 graph structure 经过定义的 node neighborhoods 来驱动空间图卷积，容许 unary 和 pairwise attention 能够被很天然的的学习到，由于 adjacency matrix 包含 self loops。

咱们想构建一个无向图 g = {V, E, A}，E 是边的集合，A 是对应的邻接矩阵。每个顶点 v 对应的是检测到的物体（包围盒的坐标以及特征向量）。咱们想学习一个邻接矩阵 A，使得每个 edge $(i, j, A_{ij})$ 是基于问题特征 q 。直观的来说，咱们须要建模 feature vectors 之间的类似性，及其余们与 given question 之间的相关性。因此，咱们将 question embedding q 与每个 visual features $v_n$ 进行组合，获得 $[v_n][q]$。咱们而后计算该联合映射：

其中，F 是非线性函数。经过将该 joint embedding $e_n$ 进行组合，获得矩阵 E，这样就能够定义一个邻接矩阵，带有 self loops，$A = EE^T$，其中，$A_{i,j} = e^T_i e_j$。

这种定义不会在图的稀疏性上有额外的约束，因此，能够产生一个全链接的邻接矩阵。这种学习到的 graph structure 将会被当作是接下来图卷积层的 backbone，目标是：learn a representation of object features that is conditioned on the most relevant, question-specific neighbours。这须要该 sparse graph structure 聚焦于图像中最相关的部分。为了学习一个每个节点的稀疏邻接系统，咱们采用 ranking 的方法：

其中，topm 返回的是：输入向量中前 m 个最大值的索引，$a_i$ 表示的是邻接矩阵的第 i 行。换句话说，给定节点的邻接系统将会对应与该节点有最强连接的 nodes。

2.3 Spatial Graph Convolutions:

给定一个特定问题的 graph structure，咱们探索一种图卷积的方法去学习一种新的目标表达（that are informed by a neighborhood system tailored to answer the given question）。图的顶点 V （BBox 的位置及其 feature）能够经过他们的在图像中的位置进行刻画，能够在空间上对他们的联系进行建模。此外，许多 VQA 问题须要该模型有关于：the model has an awareness of the orientation and relative position of features in an image, 可是许多以前的方法都忽略了这一点。

咱们使用一种 graph CNN 方法，直接在 graph domain 上依赖于空间关系直接进行操做。关键的是，他们的方法经过 pairwise pseudo-coordinate function u(i, j) 来捕获 spatial information，对于每个 vertex i，以 i 为坐标系统，u(i, j) 是那个系统中的 vertex j 的坐标。咱们的 pseudo-coordinate function u(i, j) 返回了一个极坐标向量 $\ro, \theta$，表述了顶点 i 和 j 的 BBox 的空间相对位置。咱们将 Cartesian 和 Polar 坐标做为 Gaussian kernels 的输入，而且观察看：polar coordinates 效果更好。咱们认为这是由于极坐标分为：方向和距离，提供了两个角度来表示空间关系。

一个特别的步骤和挑战是：the definition of a patch operator describing the influence of each neighbouring node that is robust to irregular neighbourhood structures. Monti et al. 提出使用 K个可学习均值和协方差的 Gaussian kernels，均值能够理解为 pseudo coordinate 的方向和距离。对每个 k，咱们获得一个 kernel weight $w_k(u)$，the patch operator is defined at kernel k for node i as:

其中，$f_n(i)$ 和 $N(i)$ 表明如公式 2 所示的顶点 i 的近邻，咱们将每个 patch operator 的输出看作是：近邻特征的加权求和（a weighted sum of the neighbouring features）, 高斯核的集合描述了每个近邻在卷积操做输出时的影响。

咱们调整 patch operator 使其包含一个额外的基于产生的 graph edges 的加权因子：

$\alpha_{ij} = s(a_i)_j$，$s(.)_j$ 是一个 scaling function 的第 j 个元素 (此处定义为 a softmax of the selected adjacency matrix elements)。这种更加 general 的形式意味着不一样顶点之间的信息传递的强度能够经过除了空间位置以外的信息进行加权。在本文的模型中，能够理解为：根据回答问题的状况，咱们应该对两个 nodes 之间的关系关注多少。因此，该网络学习基于问题以 pairwise 的方式去关注 visual features。

最终，咱们定义在顶点 i 的卷积操做的输出为，K 个 kernels 的组合：

其中，每个 $G_k$ 是一个可学习权重矩阵（即：卷积核）。这能够输出一个卷积后的图表示 H。

2.4 Prediction Layers :

经过上述步骤，咱们能够获得考虑结构化建模的特征输出，而后就能够进行答案的预测了。本文的作法是经过 max-pooling layer 获得 graph $h_{max}$ 的 global vector representation。做者将问题特征 q 和 image $h_{max}$ 经过一个元素级相乘进行融合。而后用两层 MLP 来计算分类得分。

2.5 Loss Function：

损失函数通常是采用交叉熵损失函数进行。做者采用和 [Tips and tricks for visual question answering: Learnings from the 2017 challenge] 相似的方法来处理。

3. Experiments: