一文搞懂NLP中的Attention机制（附详细代码讲解）

机器学习算法与天然语言处理出品
@公众号原创专栏做者 Don.hub
单位 | 京东算法工程师
学校 | 帝国理工大学

算法

Outline
Intuition
Analysis
Pros
Cons
From Seq2Seq To Attention Model
seq2seq 很重要，可是缺陷也很明显
attention was born
Write the encoder and decoder model
Taxonomy of attention
number of sequence
distinctive
co-attention
self
number of abstraction
single-level
multi-level
number of positions
soft/global
hard
local
number of representations
multi-representational
multi-dimensional
summary
Networks with Attention
encoder-decoder
CNN/RNN + RNN
Pointer Networks
Transformer
Memory Networks
Applications
NLG
Classification
Recommendation Systems
ref
1. Outline

2. Intuition

吸睛这个词就很表明attention，咱们在看一张图片的时候，很容易被更重要或者更突出的东西所吸引，因此咱们把更多的注意放在局部的部分上，在计算机视觉（CV）领域，就能够看做是图片的局部拥有更多的权重，好比图片生成标题，标题中的词就会主要聚焦于局部。

数据库

NLP领域，能够想象咱们在作阅读理解的时候，咱们在看文章的时候，每每是带着问题去寻找答案，因此文章中的每一个部分是须要不一样的注意力的。例如咱们在作评论情感分析的时候，一些特定的情感词，例如amazing等，咱们每每须要特别注意，由于它们是很重要的情感词，每每决定了评论者的情感。以下图（Yang et al., 何老师团队 HAN数组

直白地说，attention就是一个权重的vector。
网络

3. Analysis

3.1 Pros

attention的好处主要是具备很好的解释性，而且极大的提升了模型的效果，已是不少SOTA 模型必备的模块，特别是transformer（使用了self / global/ multi-level/ multihead/ attention）的出现极大得改变了NLP的格局。app

3.2 Cons

无法捕捉位置信息，须要添加位置信息。固然不一样的attention机制有不一样的固然若是说transformer的坏处，其最大的坏处是空间消耗大，这是由于咱们须要储存attention score（N*N）的维度，因此Sequence length（N）不能太长，这就致使，咱们seq和seq之间没有关联。（具体参照XLNET以及XLNET的解决方式）机器学习

3.3 From Seq2Seq To Attention Model

为何会有attention？attention其实就是为了翻译任务而生的（但最后又不局限于翻译任务），咱们来看看他的具体演化。ide

3.3.1 seq2seq 很重要，可是缺陷也很明显

Seq2Seq model 是有encoder和decoder组成的，它主要的目的是将输入的文字翻译成目标文字。其中encoder和decoder都是RNN，（能够是RNN/LSTM/或者GRU或者是双向RNN）。模型将source的文字编码成一串固定长度的context编码，以后利用这段编码，使用decoder解码出具体的输出target。这种转化任务能够适用于：翻译，语音转化，对话生成等序列到序列的任务。函数

可是这种模型的缺点也很明显：- 首先全部的输入都编码成一个固定长度的context vector，这个长度多少合适呢？很难有个确切的答案，一个固定长度的vector并不能编码全部的上下文信息，致使的是咱们不少的长距离依赖关系信息都消失了。- decoder在生成输出的时候，没有一个与encoder的输入的匹配机制，对于不一样的输入进行不一样权重的关注。- Second, it is unable to model alignment between input and output sequences, which is an essential aspect of structured output tasks such as translation or summarization [Young et al., 2018]. Intuitively, in sequence-to-sequence tasks, each output token is expected to be more inﬂuenced by some speciﬁc parts of the input sequence. However, decoder lacks any mechanism to selectively focus on relevant input tokens while generating each output token.学习

3.3.2 attention was born

NMT【paper】【code】最先提出了在encoder以及decoder之间追加attention block，最主要就是解决encoder 以及decoder之间匹配问题。测试

其中是decoder的初始化hidden state，是随机初始化的，相比于seq2seq（他是用context vector做为decoder的hidden 初始化），是decoder的hidden states。
表明的是第j个encoder位置的输出hidden states
表明的是第i个decoder的位置对对j个encoder位置的权重
是第i个decoder的位置的输出，就是通过hidden state输出以后再通过全链接层的输出
表明的是第i个decoder的context vector，其实输出hidden output的加权求和
decoder的输入是由自身的hidden state以及这两个的concat结果

3.3.3 Write the encoder and decoder model

详细的实现能够参照tensorflow的repo使用的是tf1.x Neural Machine Translation (seq2seq) tutorial. 这里的代码用的是最新的2.x的代码 code.

输入通过encoder以后获得的hidden states 的形状为 (batch_size, max_length, hidden_size) ， decoder的 hidden state 形状为 (batch_size, hidden_size).

如下是被implement的等式：

This tutorial uses Bahdanau attention for the encoder. Let's decide on notation before writing the simplified form:

FC = Fully connected (dense) layer
EO = Encoder output
H = hidden state
X = input to the decoder

And the pseudo-code:

score = FC(tanh(FC(EO) + FC(H)))
attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.
embedding output = The input to the decoder X is passed through an embedding layer.
merged vector = concat(embedding output, context vector)
This merged vector is then given to the GRU

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

4. Taxonomy of attention

根据不一样的分类标准，能够将attention分为多个类别，可是具体来讲都是q（query）k（key）以及v（value）之间的交互，经过q以及k计算score，这个score的计算方法各有不一样以下表，再通过softmax进行归一化。最后在将计算出来的score于v相乘加和（或者取argmax 参见pointer network）。

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017. (^) It adds a scaling factor 1/n‾√1/n, motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

如下的分类不是互斥的，好比说HAN模型，就是一个multi-level，soft，的attention model（AM）。

4.1 number of sequence

根据咱们的query以及value来自的sequence来分类。

4.1.1 distinctive

attention的query和value分别来自不一样两个不一样的input sequence和output sequence，例如咱们上文提到的NMT，咱们的query来自于decoder的hidden state，咱们的value来自去encoder的hidden state。

4.1.2 co-attention

co-attention 模型对多个输入sequences进行联合学习权重，而且捕获这些输入的交互做用。例如visual question answering 任务中，做者认为对于图片进行attention重要，可是对于问题文本进行attention也一样重要，因此做者采用了联合学习的方式，运用attention使得模型可以同时捕获重要的题干信息以及对应的图片信息。

4.1.3 self

例如文本分类或者推荐系统，咱们的输入是一个序列，输出不是序列，这种场景下，文本中的每一个词，就去看与自身序列相关的词的重要程度关联。以下图

咱们能够看看bert的self attention的实现的函数说明，其中若是from tensor= to tensor，那就是self attention

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.
  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the **same**, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-with vector

"""

4.2 number of abstraction

这是根据attention计算权重的层级来划分的。

4.2.1 single-level

在最多见的case中，attention都是在输入的sequence上面进行计算的，这就是普通的single-level attention。

4.2.2 multi-level

可是也有不少模型，例如HAN，模型结构以下。模型是hierarchical的结构的，它的attention也是做用在多层结构上的。咱们介绍一下这个模型的做用，它主要作的是一个文档分类的问题，他提出，文档是由句子组成的，句子又是由字组成的，因此他就搭建了两级的encoder（双向GRU）表示，底下的encoder编码字，上面的encoder编码句子。在两个encoder之间，链接了attention层，这个attention层是编码字层级上的注意力。在最后输出做文本分类的时候，也使用了一个句子层级上的attention，最后输出来Dense进行句子分类。须要注意的是，这里的两个query 以及都是随机初始化，而后跟着模型一块儿训练的，score方法用的也是Dense方法，可是这边和NMT不一样的是，他是self attention。

4.3 number of positions

根据attention 层关注的位置不一样，咱们能够把attention分为三类，分别是global/soft（这两个几乎同样），local以及hard attention。Effective Approaches to Attention-based Neural Machine Translation. 提出了local global attention，Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 提出了hard soft attention

4.3.1 soft/global

global/soft attention 指的是attention 的位置为输入序列的全部位置，好处在与平滑可微，可是坏处是计算量大。

4.3.2 hard

hard attention 的context vector是从采样出来的输入序列hidden states进行计算的，至关于将hidden states进行随机选择，而后计算attention。这样子能够减小计算量，可是带来的坏处就是计算不可微，须要采用强化学习或者其余技巧例如variational learning methods。

4.3.3 local

local的方式是hard和soft的折中 - 首先从input sequence中找到一个须要attention的点或者位置 - 在选择一个窗口大小，create一个local的soft attention 这样作的好处在于，计算是可微的，而且减小了计算量

4.4 number of representations

一般来讲single-representation是最多见的状况，which means 一个输入只有一种特征表示。可是在其余场景中，一个输入可能有多种表达，咱们按输入的representation方式分类。

4.4.1 multi-representational

在一些场景中，一种特征表示不足以彻底捕获输入的全部信息，输入特征能够进行多种特征表示，例如Show, attend and tell: Neural image caption generation with visual attention. 这篇论文就对文本输入进行了多种的word embedding表示，而后最后对这些表示进行attention的权重加和。再好比，一个文本输入分别词，语法，视觉，类别维度的embedding表示，最后对这些表示进行attention的权重加和。

4.4.2 multi-dimensional

顾名思义，这种attention跟维度有关。这种attention的权重能够决定输入的embedding向量中不一样维度之间的相关性。其实embedding中的维度能够看做一种隐性的特征表示（不像one_hot那种显性表示直观，虽然缺乏可解释性，可是也算是特征的隐性表示），因此经过计算不一样维度的相关性就能找出起做用最大的特征维度。尤为是解决一词多义时，这种方式很是有效果。因此，这种方法在句子级的embedding表示、NLU中都是颇有用的。

5. summary

6. Networks with Attention

介绍了那么多的attention类别，那么attention一般是运用在什么网络上的呢，咱们这边总结了两种网络，一种是encoder-decoder based的一种是memory network。

6.1 encoder-decoder

encoder-decoder网络+attention是最多见的+attention的网络，其中NMT是第一个提出attention思想的网络。这边的encoder和decoder是能够灵活改变的，并不绝对都是RNN结构。

6.1.1 CNN/RNN + RNN

对于图片转文字这种任务，能够将encoder换成CNN，文字转文字的任务可使用RNN+RNN。

6.1.2 Pointer Networks

并非全部的序列输入和序列输出的问题均可以使用encoder-decoder模型解决，(e.g. sorting or travelling salesman problem). 例以下面这个问题：咱们想要找到一堆的点，可以将图内全部的点包围起来。咱们指望获得的效果是，输入全部的点最后输出的是

若是直接下去训练的话，下图所示：input 4个data point的坐标，获得一个红色的vector，再把vector放到decoder中去，获得distribution，再作sample（好比作argmax，决定要输出token 1...），最终看看work不work，结果是不work。好比：训练的时候有50 个点，编号1-50，可是测试的时候有100个点，可是它只能选择 1-50编号的点，后面的点就选不了了。

改进：attention，可让network动态的决定输出的set有多大

x0，y0表明END这些词，每个input都会获得一个attention的weight=output的distribution。

最后的模型的结束的条件就是点的几率最高

6.1.3 Transformer

transformer网络使用的是encoder+decoder网络，其主要是解决了RNN的计算速度慢的问题，经过并行的self attention机制，提升了计算效率。可是与此同时也带来了计算量大，空间消耗过大的问题，致使sequence length长度不能过长的问题，解决参考transformerXL。（以后会写一篇关于transformer的文章） - multihead的做用：有点相似与CNN的kernel，主要捕获不一样的特征信息

6.2 Memory Networks

像是question answering，或者聊天机器人等应用，都须要传入query以及知识数据库。End-to-end memory networks.经过一个memroy blocks数组储存知识数据库，而后经过attention来匹配query和答案。memory network包含四部份内容：query（输入）的向量、一系列可训练的map矩阵、attention权重和、多hop推理。这样就可使用KB中的fact、使用history中的关键信息、使用query的关键信息等进行推理，这在QA和对话中相当重要。（这里须要补充）

7. Applications

7.1 NLG

MT：计算机翻译
QA：problems have made use of attention to (i) better understand questions by focusing on relevant parts of the question [Hermann et al., 2015], (ii) store large amount of information using memory networks to help ﬁnd answers [Sukhbaatar et al., 2015], and (iii) improve performance in visual QA task by modeling multi-modality in input using co-attention [Lu et al., 2016].
Multimedia Description（MD）：is the task of generating a natural language text description of a multimedia input sequence which can be speech, image and video [Cho et al., 2015]. Similar to QA, here attention performs the function of ﬁnding relevant acoustic signals in speech input [Chorowski et al., 2015] or relevant parts of the input image [Xu et al., 2015] to predict the next word in caption. Further, Li et al. [2017] exploit the temporal and spatial structures of videos using multi-level attention for video captioning task. The lower abstraction level extracts speciﬁc regions within a frame and higher abstraction level focuses on small subset of frames selectively.

7.2 Classification

Document classification：HAN
Sentiment Analysis：
Similarly, in the sentiment analysis task, self attention helps to focus on the words that are important for determining the sentiment of input. A couple of approaches for aspect based sentiment classiﬁcation by Wang et al. [2016] and Ma et al. [2018] incorporate additional knowledge of aspect related concepts into the model and use attention to appropriately weigh the concepts apart from the content itself. Sentiment analysis application has also seen multiple architectures being used with attention such as memory networks [Tang et al., 2016] and Transformer [Ambartsoumian and Popowich, 2018; Song et al., 2019].

7.3 Recommendation Systems

Multiple papers use self attention mechanism for ﬁnding the most relevant items in user’s history to improve item recommendations either with collaborative ﬁltering framework [He et al., 2018; Shuai Yu, 2019], or within an encoderdecoder architecture for sequential recommendations [Kang and McAuley, 2018; Zhou et al., 2018].

Recently attention has been used in novel ways which has opened new avenues for research. Some interesting directions include smoother incorporation of external knowledge bases, pre-training embeddings and multi-task learning, unsupervised representational learning, sparsity learning and prototypical learning i.e. sample selection.

8. ref

写做风格很好，最后模型那块能够再补充到本篇文章
很是好的综述An Attentive Survey of Attention Models
wildml.com/2016/01/atte
图文详解NMT（decoder那边有点错误，由于decoder的初始化的embedding 是估计是定义不通，而后初始化的用的是encoder的hidden output做为attention score的key，而后实际上是concat context和embedding做为输入）
NMT代码
pointer network
pointer slides
All Attention You Need还没看完