【论文笔记】ELMo:Deep contextualized word representations

Abstract

介绍一种新型的深度语境化(deep contextualized)词表示:

  1. 模拟了复杂的词特征的使用(例如,语法和语义)
  2. 模拟了词在不同语境中的使用(use vary across linguistic contexts)

其他要点:

  • 这个词向量是一个深度双向语言模型(biLM)内部状态的学习函数(vectors are learned functions of the internal states of a deep bidirectional language model (biLM)
  • 暴露预训练网络的深层内部是至关重要的,允许下游模型混合不同类型的半监督信号。

Introduction

ELMo (Embeddings from Language Models) , they are a function of all of the internal layers of the biLM.
ELMo表示法是深层的。更具体地说,我们学习了每个词的每层结束端向量的线性组合(we learn a linear combination of the vectors stacked above each input word for each end task),这比仅仅使用顶部的LSTM层显著地提高了性能。

LSTM的高层状态能更好的理解上下文中单词的含义(如词义消歧任务);
低层状态则在语法建模方面表现更好(如词性标注)。
Simultaneously exposing all of these signals(higher-level and lower-level states我猜) is highly beneficial.

相关工作

以前用于学习单词向量的方法只允许每个单词有一个独立于上下文的表示。
通过使用字符卷积,我们的方法也从子字单元中受益,并且,我们无缝地将多义信息合并到下游任务中,而无需明确地训练来预测预先定义好的意义类型

Different layers of deep biRNNs encode different types of information. introducing multi-task syntactic supervision at the lower levels of a deep LSTM can improve overall performance of higher level tasks such as dependency parsing or CCG super tagging.
ELMo表示的修改的语言模型的目标也会产生类似的信号,学习混合了这些不同类型的半监督的下游任务的模型是非常有益的。(这些信号就比如其他网络底层layer学到的POS信息,高层lstmlayer学到的word sense)

本论文方法:用unlabeled data对biLM进行预处理后,我们fix the weights,并增加了额外的任务特定模型容量(add additional task-specific model capacity),允许我们利用大型、丰富和通用的biLM表示,以便在下游的训练数据规模要求较小的监督模型的情况下使用。

3 ELMo: Embeddings from Language Models

ELMo word representations are functions of the entire input sentence. They are

  1. computed on top of two-layer biLMs with character convolutions,
  2. as a linear function of the internal network states .

3.1 Bidirectional language models

双向LM

3.2 ELMo

ELMo
ELMo结构

3.3 Using biLMs for supervised NLP tasks

Given a pre-trained biLM and a supervised architecture for a target NLP task, it is a simple process to use the biLM to improve the task model.
ELMo也是:

  • simply run the biLM and record all of the layer表示 for each word.
  • let the end task model learn a linear combination of these representations

具体:

  1. 许多supervised的nlp models在最底层共享一个公共架构,这允许我们以一致的、统一的方式添加ELMo。
    biLMs for NLP 1

  2. To add ELMo to the supervised model:
    (1) freeze the weights of the biLM
    (2) concatenate ELMoktask和xk,形成增强表示 [xk; ELMoktask ] 并添加到RNN:

    对于一些任务(如SNLI,SQuAD),通过引入了另一组输出的线性权值,将hk替换为:[hk; ELMoktask ],这样可以observe further improvements

    剩下的superveised model未变,这些additions可以在更复杂的神经模型上下文中发生

    例如:biLSTMs + bi-attention layer, or 一个放在biLSTM之上的聚类模型

  3. 在ELMo中加入适量的dropout是有益的

    一些情况下添加λ||w||^2到loss中来regularize ELMo weights也是有益的
    这对ELMo权重施加了一个归纳偏差,使其接近于所有biLM层的平均值。

3.4 Pre-trained bidirectional language model architecture

这篇论文的pre-trained biLMs支持两个方向上的联合训练,并且在LSTM层之间添加了residual connection。

CNN-BIG-LSTM 模型,减去了一半的embedding 和 hidden dimensions.

2 biLSTM layers with 4096 units and 512 dimension projections
第一层到第二层之间用residual connection
context insensitive type representation使用2048个字符的n-gram卷积过滤器
然后2个highway layers
一个向下延伸到512个表示的线性投影
training for 10 epochs(with the backward value slightly lower.)

因此,biLM为每个输入字符提供了三层表示,包括那些纯字符输入而不在训练集中的表示。(相比之下,传统的单词嵌入方法只为固定词汇表中的字符提供了一层表示。)

4 Evaluation

ELMo在6个不同的基准(benchmark)NLP任务集合中的性能
six benchmark NLP tasks
simply adding ELMo establishes a new state-of-the-art result.

例如,在QA中:
our baseline model is an improved version of the Bidirectional Attention Flow model(BiDAF; 2017), it adds a self-attention layer after the bidirectional attention component, 简化pooling,用GRU替换LSTMs。 添加ELMo到baseline model之后,F1显著提高。

另外,在Textual entailment、Semantic role labeling, Coreference resolution, Named entity extraction, Sentiment analysis 都有提高。

5 Analysis

ablation analysis
syntactic information is better represented at lower layers while semantic information is captured a higher layers.

5.1 Alternate layer weighting schemes

对于结合biLM层,方程1有许多备选方案。
方程1
正则化参数的选择也很重要,因为像λ =1这样的大值有效地减少了权重函数在层上的简单平均值,而较小的值(如λ =0.001)允许层权重变化。

包含来自所有层的表示比仅使用最后一层更能提高整体性能,包括来自最后一层的上下文表示比基线更能提高性能。A small λ is preferred in most cases with ELMo

5.2 Where to include ELMo?

本文中所有的任务架构都只将word嵌入作为输入输入到最低层biRNN中。但是,我们发现,在特定于任务的体系结构中,在biRNN的输出中包含ELMo,可以提高某些任务的总体结果。
在这里插入图片描述
一个可能的解释是SNLI和SQuAD架构在biRNN之后都使用了注意层,因此在这个层引入ELMo允许模型直接关注biLM的内部表示。

5.3 What information is captured by the biLM’s representations?

the biLM is able to disambiguate both the part of speech(词义) and word sense in the source sentence (词义在源句中的歧义).
在这里插入图片描述
见论文

5.4 Sample efficiency

Adding ELMo to a model increases the sample efficiency considerably, both in terms of number of parameter updates to reach state-of-the-art performance and the overall training set size.

ELMo-enhanced models use smaller training sets more efficiently than models without ELMo.

6. Conclusion

we have also confirmed that the biLM layers efficiently encode different types of syntactic and semantic information about wordsin-context, and that using all layers improves overall task performance.