[Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

时间 2019-12-09

标签 paper reading attend tell neural image caption generation visual attention 繁體版

原文原文链接

论文连接：https://arxiv.org/pdf/1502.03044.pdfhtml

代码连接：https://github.com/kelvinxu/arctic-captions & https://github.com/yunjey/show-attend-and-tell & https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflowgit

主要贡献github

在这篇文章中，做者将“注意力机制（Attention Mechanism）”引入了神经机器翻译（Neural Image Captioning）领域，提出了两种不一样的注意力机制：‘Soft’ Deterministic Attention Mechanism & ‘Hard’ Stochastic Attention Mechanism。下图展现了"Show, Attend and Tell"模型的总体框架。框架

注意力机制的关键点在于，如何从图像的特征向量a_i中计算获得上下文向量z_t。对于每个位置i，注意力机制可以产生一个权重e_ti。在Hard Attention机制中，权重α_ti所扮演的角色是图像区域向量a_i在t时刻被选中做为解码器的信息的几率，有且只有一个区域会被选中，为此，引入变量s_t,i，当区域i被选中时为1，不然为0；在Soft Attention机制中，权重α_ti所扮演的角色是图像区域向量a_i在t时刻输入解码器的信息中所占的比例。（参考Attention机制论文阅读——Soft和Hard Attention，Multimodal —— 看图说话（Image Caption）任务的论文笔记（二）引入attention机制）spa

实验细节.net

在文章中，做者提出使用在ImageNet数据集上预训练好、不进行微调的VGGNet提取图像特征，将block5_conv4（Conv2D）提取到的feature map（14×14×512）reshape为196×512（L×D，L=196，D=512，即196个图像区域，每一个区域特征向量的维度是512）的图像区域向量a_i。

To create the annotations a_i used by our decoder, we used the Oxford VGGnet pretrained on ImageNet without finetuning.翻译

In our experiments we use the 14×14×512 feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196×512 (i.e L × D) encoding.code

在文章中，做者指出，解码器LSTM初始的细胞状态（init_c）与隐层状态（init_h）由从图像中提取到的特征向量及两个独立的多层感知机（Multi-Layer Perception, MLP）决定。

The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs(init,c and init,h).htm