Paper:Effective Approaches to Attention-based Neural Machine Translation

论文原文:PDF
论文被引:4675(2020/11/08)
论文年份:2015
论文作者:Minh-Thang Luong et.al.



Abstract

An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems that already incorporate known techniques such as dropout. Our ensemble model using different attention architectures yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.

最近,注意力机制已用于通过在翻译过程中选择性地注意力源句子的某些部分来改进神经机器翻译(neural machine translation,NMT)。但是,针对基于注意力的NMT探索有用的体系结构的工作很少。本文研究了两种简单有效的注意机制:一种始终注意力所有源词的全局方法和一种仅一次查看源词子集的局部方法。我们证明了这两种方法对英语和德语在两个方向上的WMT翻译任务的有效性。与局部注意力相比,我们已经比已经集成了已知技术(例如,辍学)的非注意力系统获得了5.0个BLEU点的显着提高。我们使用不同注意力结构的集成模型在WMT’15英德翻译任务中产生了最新的最新结果,结果为25.9 BLEU点,比NMT和n-gram重整器支持的现有最佳系统提高了1.0 BLEU点。


1 Introduction

Neural Machine Translation (NMT) achieved state-of-the-art performances in large-scale translation tasks such as from English to French (Luong et al., 2015) and English to German (Jean et al., 2015). NMT is appealing since it requires minimal domain knowledge and is conceptually simple. The model by Luong et al. (2015) reads through all the source words until the end-ofsentence symbol is reached. It then starts emitting one target word at a time, as illustrated in Figure 1. NMT is often a large neural network that is trained in an end-to-end fashion and has the ability to generalize well to very long word sequences. This means the model does not have to explicitly store gigantic phrase tables and language models as in the case of standard MT; hence, NMT has a small memory footprint. Lastly, implementing NMT decoders is easy unlike the highly intricate decoders in standard MT (Koehn et al., 2003).

在从英语到法语(Luong等人,2015年)和英语到德语(Jean等人,2015年)等大规模翻译任务中,神经机器翻译(NMT)取得了最先进的表现。 NMT具有吸引力,因为它需要的领域知识最少,并且在概念上很简单。 Luong等人的模型(2015)读取所有源词,直到到达句子结尾符号。然后,它一次开始发出一个目标单词,如图1所示。NMT通常是一个大型的神经网络,以端到端的方式进行训练,并且能够很好地推广到很长的单词序列。这意味着该模型不必像标准MT那样显式存储巨大的短语表和语言模型。因此,NMT的内存占用量很小。最后,与标准MT中高度复杂的解码器不同(Koehn等,2003),实现NMT解码器很容易
在这里插入图片描述

In parallel, the concept of “attention” has gained popularity recently in training neural networks, allowing models to learn alignments between different modalities, e.g., between image objects and agent actions in the dynamic control problem (Mnih et al., 2014), between speech frames and text in the speech recognition task (?), or between visual features of a picture and its text description in the image caption generation task (Xu et al., 2015). In the context of NMT, Bahdanau et al. (2015) has successfully applied such attentional mechanism to jointly translate and align words. To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT.

同时,“注意力”的概念最近在训练神经网络中得到普及,允许模型学习不同模态之间的对齐方式,例如动态控制问题中图像对象与主体动作之间的对齐(Mnih等人,2014)。语音识别任务(?)中的语音帧和文本,或者在图像标题生成任务中图片的视觉特征与其文本描述之间的语音帧和文本(Xu等人,2015)。在NMT的背景下,Bahdanau等人(2015)成功地运用了这种注意力机制来共同翻译和对齐单词。据我们所知,没有其他工作探索将注意力集中的架构用于NMT

In this work, we design, with simplicity and effectiveness in mind, two novel types of attentionbased models: a global approach in which all source words are attended and alocal one whereby only a subset of source words are considered at a time. The former approach resembles the model of (Bahdanau et al., 2015) but is simpler architecturally. The latter can be viewed as an interesting blend between the hard and soft attention models proposed in (Xu et al., 2015): it is computationally less expensive than the global model or the soft attention; at the same time, unlike the hard attention, the local attention is differentiable almost everywhere, making it easier to implement and train.2Besides, we also examine various alignment functions for our attention-based models.

在这项工作中,我们在设计时考虑了简单性和有效性,着眼于两种新颖的基于注意力的模型:一种全局方法,其中所有源词都被注意力;而局部方法中,一次只考虑源词的一个子集。前一种方法类似于(Bahdanau等人,2015)的模型,但在架构上较为简单。后者可以看作是在(Xu等人,2015)中提出的硬注意力模型和软注意力模型之间的有趣混合:在计算上比全局模型或软注意力便宜;同时,与硬(hard)注意力不同,局部的注意力几乎在任何地方都是可区分的,从而更易于实施和训练。此外,我们还研究了基于注意力的模型的各种对齐功能

Experimentally, we demonstrate that both of our approaches are effective in the WMT translation tasks between English and German in both directions. Our attentional models yield a boost of up to 5.0 BLEU over non-attentional systems which already incorporate known techniques such as dropout. For English to German translation, we achieve new state-of-the-art (SOTA) results for both WMT’14 and WMT’15, outperforming previous SOTA systems, backed by NMT models and n-gram LM rerankers, by more than 1.0 BLEU. We conduct extensive analysis to evaluate our models in terms of learning, the ability to handle long sentences, choices of attentional architectures, alignment quality, and translation outputs.

通过实验,我们证明了我们的两种方法在英语和德语之间双向的WMT翻译任务中都是有效的。我们的注意力模型比已经结合了已知技术(例如辍学)的非注意力系统提高了多达5.0 BLEU。对于英语到德语的翻译,我们在WMT’14和WMT’15上均获得了最新的(SOTA)结果,在NMT模型和n-gram LM重排序器的支持下,性能优于以前的SOTA系统BLEU。我们进行广泛的分析,以评估我们的模型在学习,处理长句子的能力,注意结构的选择,对齐质量和翻译输出方面


2 Neural Machine Translation

A neural machine translation system is a neural network that directly models the conditional probability p ( y ∣ x ) p(y|x) p(yx) of translating a source sentence, x 1 , . . . , x n x_1, . . . , x_n x1,...,xn, to a target sentence, y 1 , . . . , y m y_1, . . . , y_m y1,...,ym. A basic form of NMT consists of two components: (a) an encoder which computes a representation s s s for each source sentence and (b) a decoder which generates one target word at a time and hence decomposes the conditional probability as:

神经机器翻译系统是一个直接对翻译源语句 x 1 , . . . , x n x_1, . . . , x_n x1,...,xn 到目标句子 y 1 , . . . , y m y_1, . . . , y_m y1,...,ym 的条件概率 p ( y ∣ x ) p(y|x) p(yx) 进行建模的神经网络。NMT的基本形式由两个部分组成:(a)一个编码器,它为每个源句子计算一个表示 s \boldsymbol{s} s;(b)一个解码器,它一次生成一个目标词,因此将条件概率分解为

l o g p ( y ∣ x ) = ∑ j = 1 m l o g p ( y j ∣ y < j , s ) (1) logp(y|x) =\sum^m_{j=1}logp(y_j|y<j,\boldsymbol{s}) \tag{1} logp(yx)=j=1mlogp(yjy<j,s)(1)

A natural choice to model such a decomposition in the decoder is to use a recurrent neural network (RNN) architecture, which most of the recent NMT work such as (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Luong et al., 2015; Jean et al., 2015) have in common. They, however, differ in terms of which RNN architectures are used for the decoder and how the encoder computes the source sentence representation s.

在解码器中对这种分解进行建模的自然选择是使用递归神经网络(RNN)架构,这是最近NMT的大部分工作,例如(Kalchbrenner and Blunsom,2013;Sutskever等人,2014;Cho 等人,2014; Bahdanau等人,2015; Luong等人,2015; Jean等人,2015)有共同点。但是,它们在将哪种RNN架构用于解码器以及编码器如何计算源语句表示 s s s 方面有所不同

Kalchbrenner and Blunsom (2013) used an RNN with the standard hidden unit for the decoder and a convolutional neural network for encoding the source sentence representation. On the other hand, both Sutskever et al. (2014) and Luong et al. (2015) stacked multiple layers of an RNN with a Long Short-Term Memory (LSTM) hidden unit for both the encoder and the decoder. Cho et al. (2014), Bahdanau et al. (2015), and Jean et al. (2015) all adopted a different version of the RNN with an LSTM-inspired hidden unit, the gated recurrent unit (GRU), for both components.

Kalchbrenner and Blunsom(2013)使用带有标准隐藏单元的RNN进行解码,并使用卷积神经网络对源语句表示进行编码。另一方面,Sutskever等人(2014)和Luong等人(2015年)将RNN的多层堆叠在一起,为编码器和解码器提供了一个长期短期记忆(LSTM)隐藏单元。 Cho等人(2014),Bahdanau等人(2015年)和Luong等人(2015年)都采用了RNN的不同版本,这两个组件都具有LSTM启发的隐藏单元,即门控循环单元(GRU)

In more detail, one can parameterize the probability of decoding each word y j y_j yj as:

更详细地讲,可以将解码每个单词 y j y_j yj 的概率参数化为
p ( y j ∣ y < j , s ) = s o f t m a x ( g ( h j ) ) (2) p(y_j|y<j,\boldsymbol{s}) = softmax (g(\boldsymbol{h_j})) \tag{2} p(yjy<j,s)=softmax(g(hj))(2)

with g g g being the transformation function that outputs a vocabulary-sized vector. Here, hj is the RNN hidden unit, abstractly computed as:

其中, g g g 是输出词汇量大小向量的转换函数, h j \boldsymbol{h_j} hj 是RNN隐藏单元,抽象计算为

h j = f ( h j − 1 , s ) (3) \boldsymbol{hj}= f(\boldsymbol{h_{j−1}},\mathbb{s}) \tag{3} hj=f(hj1,s)(3)

where f computes the current hidden state given the previous hidden state and can be either a vanilla RNN unit, a GRU, or an LSTM unit. In (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Luong et al., 2015), the source representation s is only used once to initialize the decoder hidden state. On the other hand, in (Bahdanau et al., 2015; Jean et al., 2015) and this work, s, in fact, implies a set of source hidden states which are consulted throughout the entire course of the translation process. Such an approach is referred to as an attention mechanism, which we will discuss next.

其中, f f f 根据给定的先前隐藏状态计算当前隐藏状态,并且可以是原始RNN单元,GRU或LSTM单元。在(Kalchbrenner and Blunsom;Sutskever; Cho;Luong)的工作中,源表示 s \boldsymbol{s} s 仅用于初始化解码器隐藏状态一次。另一方面,在(Bahdanau;Jean)的工作中,这项工作实际上隐含了一组源隐藏状态,这些状态在翻译过程的整个过程中都会被参考。这种方法称为注意力机制,我们将在下面进行讨论

In this work, following (Sutskever et al., 2014; Luong et al., 2015), we use the stacking LSTM architecture for our NMT systems, as illustrated in Figure 1. We use the LSTM unit defined in (Zaremba et al., 2015). Our training objective is formulated as follows:

在(Sutskever;Luong)等人的工作之后,我们为我们的NMT系统使用了堆叠LSTM架构,如图1所示。我们使用Zaremba等人(2015)中关于LSTM的定义,训练目标制定如下:

J t = ∑ ( x , y ) ∈ D − l o g p ( y ∣ x ) (4) J_t=\sum_{(x,y)∈\Bbb D}−logp(y|x) \tag{4} Jt=(x,y)Dlogp(yx)(4)

with D \Bbb D D being our parallel training corpus (平行训练语料库).


3 Attention-based Models

Our various attention-based models are classifed into two broad categories, global and local. These classes differ in terms of whether the “attention” is placed on all source positions or on only a few source positions. We illustrate these two model types in Figure 2 and 3 respectively.

我们基于注意力的各种模型分为全局和局部两大类。这些类别在将“注意力”放在所有来源位置还是仅在少数来源位置上有所不同。我们分别在图2和3中说明了这两种模型类型。
在这里插入图片描述
Common to these two types of models is the fact that at each time step t in the decoding phase, both approaches first take as input the hidden state ht at the top layer of a stacking LSTM. The goal is then to derive a context vector ct that captures relevant source-side information to help predict the current target word yt. While these models differ in how the context vector ct is derived, they share the same subsequent steps.

这两种模型的共同点是,在解码阶段的每个时间点 t t t,两种方法都首先将堆叠LSTM顶层的隐藏状态 h t h_t ht 作为输入。然后,目标是获得上下文向量 c t \boldsymbol{c_t} ct,该上下文向量 c t \boldsymbol{c_t} ct 捕获相关的源端信息以帮助预测当前目标词 y t y_t yt。尽管这些模型在上下文向量 c t \boldsymbol{c_t} ct 的导出方式上有所不同,但是它们共享相同的后续步骤

Specifically, given the target hidden state htand the source-side context vector ct, we employ a simple concatenation layer to combine the information from both vectors to produce an attentional hidden state as follows:

具体来说,给定目标隐藏状态 h t \boldsymbol{h_t} ht 和源端上下文向量 c t \boldsymbol{c_t} ct,我们采用一个简单的串联层将来自两个向量的信息进行组合,以产生一个注意力的隐藏状态,如下所示

h t ~ = t a n h ( W c [ c t ; h t ] ) (5) \tilde{\boldsymbol{h_t}} = tanh(\bold{W_c}[\boldsymbol{c_t};\boldsymbol{h_t}]) \tag{5} ht~=tanh(Wc[ct;ht])(5)

The attentional vector h t ~ \tilde{\boldsymbol{h_t}} ht~ is then fed through the softmax layer to produce the predictive distribution formulated as:

然后,将注意力向量 h t ~ \tilde{\boldsymbol{h_t}} ht~ 馈送到softmax层,以生成如下的预测分布:

p ( y t ∣ y < t , x ) = s o f t m a x ( W s h t ~ ) (6) p(y_t|y<t, x) = softmax(\bold{W_s} \tilde{\boldsymbol{h_t}}) \tag{6} p(yty<t,x)=softmax(Wsht~)(6)

We now detail how each model type computes the source-side context vector c t \boldsymbol{c_t} ct.

3.1 Global Attention

The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector ct. In this model type, a variable-length alignment vector at, whose size equals the number of time steps on the source side, is derived by comparing the current target hidden state htwith each source hidden state h s ˉ \bar{\boldsymbol{h_s}} hsˉ:

全局注意力模型的思想是在导出上下文向量 c t \boldsymbol{c_t} ct 时考虑编码器的所有隐藏状态。在这种模型类型中,通过将当前目标隐藏状态 h t \boldsymbol{h_t} ht 与每个源隐藏状态 h s ˉ \bar{\boldsymbol{h_s}} hsˉ 进行比较,可以得出一个变长对齐向量 a t \boldsymbol{a_t} at,其大小等于源端的时间步长:

α t , i = align ( h t , h s ˉ ) = exp ⁡ ( score ( h t , h s ˉ ) ) ∑ i ′ = 1 n exp ⁡ ( score ( h t , h s ′ ˉ ) ) (7) \begin{aligned} \alpha_{t,i} &= \text{align}(\boldsymbol{h_t},\bar{\boldsymbol{h_s}}) &= \frac{\exp(\text{score}(\boldsymbol{h_t},\bar{\boldsymbol{h_s}}))}{\sum_{i'=1}^n \exp(\text{score}(\boldsymbol{h_t},\bar{\boldsymbol{h_{s'}}}))} \end{aligned} \tag{7} αt,i=align(ht,hsˉ)=i=1nexp(score(ht,hsˉ))exp(score(ht,hsˉ))(7)

Here, score is referred as a content-based function for which we consider three different alternatives:

在这里,分数被称为基于内容的函数,为此我们考虑了三种不同的选择:
s c o r e ( h t , h s ˉ ) = { h t T h s ˉ dot  h t T W a h s ˉ general  v a ⊤ tanh ⁡ ( W a [ h t ; h s ˉ ) concat score \left(\boldsymbol{h}_t, \bar{\boldsymbol{h_s}} \right)= \left\{ \begin{array}{lr} \boldsymbol{h_t^T} \bar{\boldsymbol{h_s}} & \text{dot}\ \\ \boldsymbol{h_t^T} \mathbf{W_a} \bar{\boldsymbol{h_s}} & \text{general}\ \\ \mathbf{v}_a^\top \tanh(\mathbf{W}_a[\boldsymbol{h}_t; \bar{\boldsymbol{h_s}}) & \text{concat} \end{array} \right. score(ht,hsˉ)=htThsˉhtTWahsˉvatanh(Wa[ht;hsˉ)dot general concat

Besides, in our early attempts to build attentionbased models, we use a location-based function in which the alignment scores are computed from solely the target hidden state htas follows:

此外,在我们早期建立基于注意力的模型的尝试中,我们使用基于位置的函数,其中仅根据目标隐藏状态 h t \boldsymbol{h}_t ht 来计算比对齐数:

a t = s o f t m a x ( W a h t ) (location 8) \boldsymbol{a}_t= softmax(\mathbf{W_a} \boldsymbol{h}_t) \tag{location 8} at=softmax(Waht)(location 8)

Given the alignment vector as weights, the context vector ctis computed as the weighted average over all the source hidden states.

给定对齐向量作为权重,上下文向量 c t \boldsymbol{c_t} ct 计算为所有源隐藏状态的加权平均值。

Comparison to (Bahdanau et al., 2015)– While our global attention approach is similar in spirit to the model proposed by Bahdanau et al. (2015), there are several key differences which reflect how we have both simplified and generalized from the original model. First, we simply use hidden states at the top LSTM layers in both the encoder and decoder as illustrated in Figure 2. Bahdanau et al. (2015), on the other hand, use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in their non-stacking unidirectional decoder. Second, our computation path is simpler; we go from ht→ at→ ct→˜ht then make a prediction as detailed in Eq. (5), Eq. (6), and Figure 2. On the other hand, at any time t, Bahdanau et al. (2015) build from the previous hidden state ht−1→ at→ ct→ ht, which, in turn, goes through a deep-output and a maxout layer before making predictions.7 Lastly, Bahdanau et al. (2015) only experimented with one alignment function, the concat product; whereas we show later that the other alternatives are better.

与Bahdanau等人的比较——尽管我们的全局注意力方式在本质上类似于Bahdanau等人提出的模型,但有几个主要差异反映了我们如何从原始模型简化和概括。首先,我们仅在编码器和解码器的LSTM顶层使用隐藏状态,如图2所示。另一方面,使用双向编码器中的前向和后向源隐藏状态与非堆叠式单向解码器中的目标隐藏状态的串联(concatenation)。其次,我们的计算路径更简单;从 h t → a t → c t → h t ~ \boldsymbol{h}_t→\boldsymbol{a}_t→\boldsymbol{c}_t→\tilde{\boldsymbol{h}_t} htatctht~ 进行预测,在公式(5),(6)和图2详细说明。另一方面,在任意时刻 t t t,Bahdanau等人从先前的隐藏状态 h t − 1 → a t → c t → h t ~ \boldsymbol{h}_{t-1}→\boldsymbol{a}_t→\boldsymbol{c}_t→\tilde{\boldsymbol{h}_t} ht1atctht~ 进行构建,然后在进行预测之前先经过深层输出和maxout层。最后,Bahdanau等人只尝试了一种对齐函数,即 concat product;而我们稍后将展示其他替代方案更好。

3.2 Local Attention

The global attention has a drawback that it has to attend to all words on the source side for each target word, which is expensive and can potentially render it impractical to translate longer sequences, e.g., paragraphs or documents. To address this deficiency, we propose a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word.

全局注意力的缺点在于,它必须注意力每个目标词在源方的所有词,这很昂贵,并且可能使翻译更长的序列(例如段落或文档)不切实际。为了解决这一缺陷,我们提出了一种局部注意力机制,该机制选择只注意力每个目标词的源位置的一小部分
在这里插入图片描述

This model takes inspiration from the tradeoff between the soft and hard attentional models proposed by Xu et al. (2015) to tackle the image caption generation task. In their work, soft attention refers to the global attention approach in which weights are placed “softly” over all patches in the source image. The hard attention, on the other hand, selects one patch of the image to attend to at a time. While less expensive at inference time, the hard attention model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train.

该模型的灵感来自Xu等人(2015)提出的解决图像标题生成任务的软注意力模型和硬注意力模型之间的折衷。在他们的工作中,软注意力是指全局注意力方法,其中权重被“软”放置在源图像中的所有图像块(patch)上。另一方面,硬注意力一次选择一个图像块。硬注意力模型虽然在推理时较便宜,但它是不可微分的,需要更复杂的技术,例如减少方差或强化学习训练

Our local attention mechanism selectively focuses on a small window of context and is differentiable. This approach has an advantage of avoiding the expensive computation incurred in the soft attention and at the same time, is easier to train than the hard attention approach. In concrete details, the model first generates an aligned position pt for each target word at time t. The context vector ctis then derived as a weighted average over the set of source hidden states within the window [pt−D, pt+D]; D is empirically selected.8Unlike the global approach, the local alignment vector at is now fixed-dimensional, i.e., ∈ R2D+1. We consider two variants of the model as below.

我们的局部注意力机制有选择地专注于一个较小的上下文窗口,并且是可微分的。该方法的优点是避免软注意力的昂贵计算,并且同时比硬注意力方法更容易训练。具体而言,模型首先在时间 t t t 为每个目标单词生成对齐的位置 p t p_t pt。然后,将上下文向量 c t \boldsymbol{c}_t ct 作为窗口 [ p t − D , p t + D ] [p_t-D,p_t+D] [ptDpt+D] 内的一组源隐藏状态的加权平均值得出; D D D 是根据经验选择的。与全局注意力方法不同,局部对齐向量 a t \boldsymbol{a}_t at 现在是固定维度的,即 ∈ R 2 D + 1 ∈ \Bbb {R}^{2D + 1} R2D+1。考虑以下模型的两个变体。

Monotonic alignment (local-m) – we simply set pt= t assuming that source and target sequences are roughly monotonically aligned. The alignment vector at is defined according to Eq. (7)

单调对齐(local-m)——假设源序列和目标序列大致单调对齐,我们只需设置 p t = t p_t = t pt=t。根据等式(7)定义对齐向量 a t \boldsymbol{a}_t at

Predictive alignment (local-p) – instead of assuming monotonic alignments, our model predicts an aligned position as follows:

预测对齐(local-p)——我们的模型不是假设单调对齐,而是预测对齐位置,如下所示:

p t = S ⋅ s i g m o i d ( v p ⊤   t a n h ( W p h t ) ) (9) p_t= S · sigmoid(\boldsymbol{v}^⊤_p \ tanh(\boldsymbol{W}_p \boldsymbol{h}_t)) \tag{9} pt=Ssigmoid(vp tanh(Wpht))(9)

Wp and vp are the model parameters which will be learned to predict positions. S is the source sentence length. As a result of sigmoid, pt ∈ [0, S]. To favor alignment points near pt, we place a Gaussian distribution centered around pt. Specifically, our alignment weights are now defined as:

W p \boldsymbol{W}_p Wp v p \boldsymbol{v}_p vp 是模型参数,将学习这些参数以预测位置。 S S S 是源句的长度。作为sigmoid的结果, p t ∈ [ 0 , S ] p_t∈[0,S] pt[0S]。为了支持 p t p_t pt 附近的对齐点,我们以 p t p_t pt 为中心放置高斯分布。具体来说,我们的对齐权重现在定义为:

a t ( s ) = a l i g n ( h t , h s ˉ ) e x p ( − ( s − p t ) 2 2 σ 2 ) (10) \boldsymbol{a}_t(s) = align(\boldsymbol{h}_t, \bar{\boldsymbol{h}_s}) exp(-\frac{(s-p_t)^2}{2 \sigma^2}) \tag{10} at(s)=align(ht,hsˉ)exp(2σ2(spt)2)(10)

We use the same align function as in Eq. (7) and the standard deviation is empirically set as σ=D 2. Note that ptis a real nummber; whereas s is an integer within the window centered at pt.

我们使用与公式(7)中相同的 a l i g n align align 函数,标准差根据经验设置为 σ = D 2 σ=\frac{D}{2} σ=2D。而 s s s 是窗口内以 p t p_t pt 为中心的整数。

Comparison to (Gregor et al., 2015)– have proposed a selective attention mechanism, very similar to our local attention, for the image generation task. Their approach allows the model to select an image patch of varying location and zoom. We, instead, use the same “zoom” for all target positions, which greatly simplifies the formulation and still achieves good performance.

与(Gregor等,2015)的比较——针对图像生成任务提出了一种与我们的全局注意力非常相似的选择性注意力机制。他们的方法允许模型选择不同位置和缩放比例的图像块。相反,我们对所有目标位置都使用相同的“缩放”,这极大地简化了公式,但仍具有良好的性能

3.3 Input-feeding Approach

In our proposed global and local approaches, the attentional decisions are made independently, which is suboptimal. Whereas, in standard MT, a coverage set is often maintained during the translation process to keep track of which source words have been translated. Likewise, in attentional NMTs, alignment decisions should be made jointly taking into account past alignment information. To address that, we propose an inputfeeding approach in which attentional vectors˜ht are concatenated with inputs at the next time steps as illustrated in Figure 4. The effects of having such connections are two-fold: (a) we hope to make the model fully aware of previous alignment choices and (b) we create a very deep network spanning both horizontally and vertically.

在我们提出的全局和局部注意力方法中,注意力决定是独立做出的,这是次优的。而在标准MT中,通常会在翻译过程中维护覆盖范围集,以跟踪已翻译的源词。同样,在注意力的NMT中,应该结合过去的对齐信息共同做出对齐决策。为了解决这个问题,我们提出了一种输入馈送的方法,其中注意力向量 h t ~ \tilde{\boldsymbol{h}_t} ht~ 在下一个时间步长与输入连接(concatenate)起来,如图4所示。具有这种连接的效果是双重的:(a)我们希望使模型完全意识到以前的对齐方式选择;(b)我们创建了一个非常深的网络,横跨了水平和垂直方向。
在这里插入图片描述
Comparison to other work – Bahdanau et al. (2015) use context vectors, similar to our ct, in building subsequent hidden states, which can also achieve the “coverage” effect. However, there has not been any analysis of whether such connections are useful as done in this work. Also, our approach is more general; as illustrated in Figure 4, it can be applied to general stacking recurrent architectures, including non-attentional models.

与其他工作的比较——Bahdanau等人(2015)使用上下文向量,类似于我们的ct,建立后续的隐藏状态,这也可以实现“覆盖”效果。但是,尚未对这种连接在本工作中是否有用进行任何分析。同样,我们的方法更通用。如图4所示,它可以应用于通用堆栈循环模型,包括非注意力模型

Xu et al. (2015) propose a doubly attentional approach with an additional constraint added to the training objective to make sure the model pays equal attention to all parts of the image during the caption generation process. Such a constraint can also be useful to capture the coverage set effect in NMT that we mentioned earlier. However, we chose to use the input-feeding approach since it provides flexibility for the model to decide on any attentional constraints it deems suitable.

Xu等人(2015)提出了一种双注意力方法,在训练目标上增加了额外的约束条件,以确保模型在字幕生成过程中对图像的所有部分给予同等的注意力。这样的约束对于捕获我们前面提到的NMT中的覆盖集效果也很有用。但是,我们选择使用input-feeding方法,因为它为模型提供了灵活性,可以决定模型认为合适的注意约束


4 Experiments

We evaluate the effectiveness of our models on the WMT translation tasks between English and German in both directions. newstest2013 (3000 sentences) is used as a development set to select our hyperparameters. Translation performances are reported in case-sensitive BLEU (Papineni et al., 2002) on newstest2014 (2737 sentences) and newstest2015 (2169 sentences). Following (Luong et al., 2015), we report translation quality using two types of BLEU: (a) tokenized12BLEU to be comparable with existing NMT work and (b) NIST13BLEU to be comparable with WMT results.

我们评估了模型在英语和德语之间双向进行的WMT翻译任务中的有效性。 newstest2013(3000个句子)用作选择我们的超参数的开发集。据区分大小写的BLEU(Papineni等人,2002)在newstest2014(2737句)和newstest2015(2169句)上报告了翻译性能。随后(Luong等,2015),我们使用两种类型的BLEU报告翻译质量:(a)标记化的BLEU与现有的NMT工作相当,(b)NIST BLEU与WMT的结果相当。

4.1 Training Details

在这里插入图片描述

4.2 English-German Results

在这里插入图片描述

4.3 German-English Results

在这里插入图片描述


5 Analysis

We conduct extensive analysis to better understand our models in terms of learning, the ability to handle long sentences, choices of attentional architectures, and alignment quality. All results reported here are on English-German newstest2014.

我们进行广泛的分析,以便在学习,处理长句子的能力,注意结构的选择以及对齐质量方面更好地理解我们的模型。此处报告的所有结果均在English-Germantesttest2014上提供。

5.1 Learning curves

We compare models built on top of one another as listed in Table 1. It is pleasant to observe in Figure 5 a clear separation between non-attentional and attentional models. The input-feeding approach and the local attention model also demonstrate their abilities in driving the test costs lower. The non-attentional model with dropout (the blue + curve) learns slower than other non-dropout models, but as time goes by, it becomes more robust in terms of minimizing test errors.

我们比较了表1列出的建立的模型。很高兴在图5中观察到非注意力模型和注意力模型之间的清晰区分。输入馈送方法和全局注意力模型还展示了它们降低测试成本的能力。具有dropout(蓝色+曲线)的非注意力模型的学习速度比其他non-dropout模型慢,但是随着时间的流逝,它在最小化测试错误方面变得更加强大
在这里插入图片描述

5.2 Effects of Translating Long Sentences

We follow (Bahdanau et al., 2015) to group sentences of similar lengths together and compute a BLEU score per group. Figure 6 shows that our attentional models are more effective than the non-attentional one in handling long sentences: the quality does not degrade as sentences become longer. Our best model (the blue + curve) outperforms all other systems in all length buckets.

我们遵循(Bahdanau et al,2015)将相似长度的句子分组在一起,并计算每组的BLEU分数。图6显示,在处理长句子时,我们的注意力模型比非注意力模型更有效:随着句子变长,质量不会降低。我们最好的模型(蓝色+曲线)在所有长度的buckets中均胜过所有其他系统。
在这里插入图片描述

5.3 Choices of Attentional Architectures

We examine different attention models (global, local-m, local-p) and different alignment functions (location, dot, general, concat) as described in Section 3. Due to limited resources, we cannot run all the possible combinations. However, results in Table 4 do give us some idea about different choices. The location-based function does not learn good alignments: the global (location) model can only obtain a small gain when performing unknown word replacement compared to using other alignment functions. For contentbased functions, our implementation concat does not yield good performances and more analysis should be done to understand the reason. It is interesting to observe that dot works well for the global attention and general is better for the local attention. Among the different models, the local attention model with predictive alignments (localp) is best, both in terms of perplexities and BLEU.

如第3节所述,我们测试了不同的注意力模型(global,local-m,local-p)和不同的对齐函数(location, dot, general, concat)。由于资源有限,我们无法运行所有可能的组合。但是,表4中的结果确实使我们对不同的选择有所了解。基于位置的函数无法获得良好的对齐方式:与使用其他对齐功能相比,全局(位置)模型在执行未知单词替换时只能获得很小的收益。对于基于内容的函数,我们的实现concat不能产生良好的性能,应进行更多分析以了解原因。有趣的是,dot对于全局注意力效果很好,而对于全局注意力效果更好。在不同的模型中,就复杂性和BLEU而言,具有预测性比对(local-p)的局部注意力模型是最好的
在这里插入图片描述

5.4 Alignment Quality

A by-product of attentional models are word alignments. While (Bahdanau et al., 2015) visualized alignments for some sample sentences and observed gains in translation quality as an indication of a working attention model, no work has assessed the alignments learned as a whole. In contrast, we set out to evaluate the alignment quality using the alignment error rate (AER) metric.

注意模型的副产品是单词对齐。 (Bahdanau et al,2015)可视化了一些例句的对齐方式,并观察到翻译质量的提高作为工作注意力模型的指标,但尚无任何工作可以整体评估所学的对齐方式。相反,我们着手使用对准误差率(AER)指标评估对准质量

Given the gold alignment data provided by RWTH for 508 English-German Europarl sentences, we “force” decode our attentional models to produce translations that match the references. We extract only one-to-one alignments by selecting the source word with the highest alignment weight per target word. Nevertheless, as shown in Table 6, we were able to achieve AER scores comparable to the one-to-many alignments obtained by the Berkeley aligner (Liang et al., 2006).

鉴于RWTH提供的508条英语-德语Europarl句子的黄金对齐数据,我们“强制”解码我们的注意力模型以产生与参考文献相匹配的译文。通过选择每个目标单词具有最高对齐权重的源单词,我们仅提取一对一的对齐方式。然而,如表6所示,我们能够获得与伯克利比对仪获得的一对多比对可比的AER分数(Liang等,2006)。

We also found that the alignments produced by local attention models achieve lower AERs than those of the global one. The AER obtained by the ensemble, while good, is not better than the localm AER, suggesting the well-known observation that AER and translation scores are not well correlated (Fraser and Marcu, 2007). We show some alignment visualizations in Appendix A

我们还发现,局部注意力模型产生的比对实现的AER低于全局注意力模型。合奏获得的AER不错,但并不比局部AER好,这表明众所周知的观察结果是AER与翻译分数之间没有很好的相关性(Fraser和Marcu,2007)。我们在附录A中显示了一些对齐方式可视化。

5.5 Sample Translations

在这里插入图片描述


6 Conclusion

In this paper, we propose two simple and effective attentional mechanisms for neural machine translation: the global approach which always looks at all source positions and the local one that only attends to a subset of source positions at a time. We test the effectiveness of our models in the WMT translation tasks between English and German in both directions. Our local attention yields large gains of up to 5.0 BLEU over non-attentional models which already incorporate known techniques such as dropout. For the English to German translation direction, our ensemble model has established new state-of-the-art results for both WMT’14 and WMT’15, outperforming existing best systems, backed by NMT models and n-gram LM rerankers, by more than 1.0 BLEU.

在本文中,我们为神经机器翻译提出了两种简单有效的注意力机制:始终关注所有源位置的全局方法和一次只关注源位置子集的局部方法。我们在双向英语和德语之间的WMT翻译任务中测试了模型的有效性。与已经结合了已知技术(例如,辍学)的非注意模型相比,我们的本地注意可产生高达5.0 BLEU的巨大收益。对于英语到德语的翻译方向,我们的集成模型为WMT’14和WMT’15建立了最新的技术成果,在NMT模型和n-gram LM rerankers支持下,性能超过了现有的最佳系统。大于1.0 BLEU。

We have compared various alignment functions and shed light on which functions are best for which attentional models. Our analysis shows that attention-based NMT models are superior to nonattentional ones in many cases, for example in translating names and handling long sentences.

我们比较了各种对齐函数,并指出了哪种函数最适合哪种注意力模型。我们的分析表明,在许多情况下,例如在翻译姓名和处理长句子时,基于注意力的NMT模型要优于非注意力模型