[Paper Reading] Show and Tell: A Neural Image Caption Generator

时间 2019-12-09

标签 paper reading tell neural image caption generator 繁體版

原文原文链接

论文连接：https://arxiv.org/pdf/1411.4555.pdfgit

代码连接：https://github.com/karpathy/neuraltalk & https://github.com/karpathy/neuraltalk2 & https://github.com/zsdonghao/Image-Captioninggithub

主要贡献网络

在这篇文章中，做者借鉴了神经机器翻译（Neural Machine Translation）领域的方法，将“编码器-解码器（Encoder-Decoder）”模型引入了神经图像标注（Neural Image Captioning）领域，提出了一种端到端（end-to-end）的模型解决图像标注问题。下面展现了从论文中截取的两幅图片，第一幅图片是NIC模型的概述，第二幅图片描述了网络的细节。NIC网络采用卷积神经网络（CNN）做为编码器，长短时间记忆网络（LSTM）做为解码器。学习

实验细节优化

在文章中，做者提出使用在图像分类任务（Image Classification Task）中预训练好的Inception v2做为编码器，将其最后一个隐藏层提取到的特征做为解码器隐藏层的初始状态。可是，在官方给出的源码neuraltalk中，做者使用了预训练好的VGG16做为了编码器，将Layer FC-4096提取到的特征做为了LSTM隐藏层的初始状态（详见neuraltalk/py_caffe_feat_extract.py line160）。在官方给出的源码neuraltalk2中，一样使用了VGG16做为编码器提取图像特征（详见neuraltalk2/train.lua line27）。在zsdonghao对该方法的TensorFlow实现中，使用了Inception v3做为编码器（详见zsdonghao/Image-Captioning/inception_v3(for TF 0.10).py）。

Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.ui

An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence.编码

在文章中，做者提出使用随机梯度降低（Stochastic Gradient Descent）训练网络。在官方给出的源码neuraltalk2中，做者给出了多种训练网络的优化器及其参数（rmsprop，adagrad，sgd……详见neuraltalk2/misc/optim_updates.lua）。zsdonghao/Image-Captioning使用SGD训练网络，初始学习率2.0，学习率衰减因子0.5，学习率降低后每一代的数量8.0。

It is a neural net which is fully trainable using stochastic gradient descent.lua

在文章中，做者提出按最大似然训练模型参数。在zsdonghao/Image-Captioning中，做者使用了tensorlayer.cost.cross_entropy_seq_with_mask()（详见zsdonghao/Image-Captioning/buildmodel.py line665）。

The model is trained to maximize the likelihood of the target description sentence given the training image.spa

在neuraltalk2中，LSTM层的输入（Embedding层的输出）向量维度和LSTM隐藏层的向量维度均设置为512。zsdonghao/Image-Captioning的设置相同。
在zsdonghao/Image-Captioning中，做者将vocabulary_size设置为12000。