Unpaired/Partially/Unsupervised Image Captioning

时间 2019-11-11

标签 unpaired partially unsupervised image captioning 繁體版

原文原文链接

这篇涉及到如下三篇论文:函数

Unpaired Image Captioning by Language Pivoting (ECCV 2018)学习

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)ui

Unsupervised Image Caption (CVPR 2019)spa

1. Unpaired Image Captioning by Language Pivoting (ECCV 2018)

Abstract翻译

做者提出了一种经过语言枢轴(language pivoting)的方法来解决没有成对的图片和描述的image caption问题(unpaired image captioning problem)。设计

Our method can effectively capture the characteristics of an image captioner from the pivot language(Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus.3d

Introductioncode

因为encoder-decoder结构须要大量的image-caption pairs来训练，一般这样的大规模标记数据是难以得到的，研究人员开始思考经过非成对的数据或者是用半监督的方法来利用其余领域成对的标记数据来实现无监督学习的目的。在本文中，做者但愿经过使用源语言——中文做为枢轴语言，来消除输入图片和目标语言——英文描述之间的间隔，这须要有图片——中文描述以及中文——英文两个成对的数据集，从而达到不须要有图片——英文描述成对数据集来实现图片到英文描述生成的目的。blog

做者说这种思想来源于机器翻译领域的相关研究，使用这种策略的机器翻译方法一般分为两步，首先将源语言翻译成枢轴语言，而后将枢轴语言翻译成目标语言。可是image caption与机器翻译又有不少不一样的地方：1.image-Chinese caption和Chinese-English中句子的风格和词汇分布有很大区别;2.source-to-pivot转换的错误会传递到pivot-to-target图片

Use AIC-ICC and AIC-MT as the training datasets and two datasets (MSCOCO and Flickr30K) as the validation datasets

i: source image, x: pivot language sentence, y: target language, y_hat: ground truth captions in target language(对于这里的y_hat，是从MSCOCO训练集里面随机抽取的描述性语句(captions)，用来训练下autoencoder)

这篇文章的思想比较容易理解，难点是把Image-to-Pivot和Pivot-to-Target联系起来，克服两个数据集语言风格和词汇分布不一致这两个问题。

2. Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)

做者在这篇文章中指出，目前已有的caption模型倾向于复制训练集中的句子或短语，生成的描述一般是泛化和模板化的，缺少生成区分性描述的能力。

基于GAN的caption模型能够提高句子的多样性，但在标准的评价指标上会有比较差的表现。

做者提出在Captioning Module上结合一个Self-retrieval Module，来达到generate discriminative captions的目的。

3. Unsupervised Image Caption (CVPR 2019)

这是一篇真正的无监督方法来作Image Caption的文章，不 rely on any labeled image sentence pairs

与Unsupervised Machine Translation相比，Unsupervised Image Caption任务更具挑战是由于图像和文本是两个不一样的模态，有很大的差异。

模型由an image encoder, a sentence generator，a sentence discriminator组成。

Encoder:

普通的image encoder便可，做者采用的是Inception-V4

Generator:

由LSTM组成的decoder

Discriminator:

由LSTM来实现，用来distinguish whether a partial sentence is a real sentence from the corpus or is generated by the model.

Training:

因为do not have any paired image-sentence，就不能用有监督的方式来训练模型了，因而做者设计了三种目标函数来实现Unsupervised Image Captioning

Adversarial Caption Generation:

Visual Concept Distillation:

Bi-directional Image-Sentence Reconstruction:

Image Reconstruction: reconstruct the image features instead of the full image

Sentence Reconstruction: the discriminator can encode one sentence and project it into the common latent space, which can be viewed as one image representation related to the given sentence. The generator can reconstruct the sentence based on the obtained representation.

Integration:Generator:

Discriminator:

Initialization

It challenging to adequately train our image captioning model from scratch with the given unpaired data, need an initialization pipeline to pre-train the generator and discriminator.

For generator:

Firstly, build a concept dictionary consisting of the object classes in the OpenImages dataset

Second, train a concept-to-sentence(con2sen) model using the sentence corpus only

Third, detect the visual concepts in each image using the existing visual concept detector. Use the detected concepts and the concept-to-sentence model to generate a pseudo caption for each image

Fourth, train the generator with the pseudo image-caption pairs

For discriminator, initialized by training an adversarial sentence generation model on the sentence corpus.