阅读《Learning to Ask: Neural Question Generation for Reading Comprehension 》

阅读《Learning to Ask: Neural Question Generation for Reading Comprehension 》



作者为解决机器生成问题,提出了一种基于注意力的序列学习模型并研究了句子级别和段落信息编码之间的影响。与以前的工作不同,他们的模型不依赖手工生成的规则或者复杂的NLP管道(不是很理解,原文为 Sophisticated NLP pipeline )。人工评价生成的问题更自然,也更难回答,与原文在语法和句话上有区别,需要推理回答。


Question generate function

In addition to the above applications, question generation systems can aid in the development of annotated data sets for natural language processing (NLP) research in reading comprehension and question answering. Indeed the creation of such datasets.

Example :the natural qusetion and their answers


Natural question features

Vanderwende 指出学会问问题是NLP研究一个重要的问题,并且问题不仅仅是一个陈述句句子的句法转换。
- In particular, a natural sounding question often compresses the sentence on which it is based (e.g., question 3 in Figure 1)
- 一个自然而然的问题往往明白句子是基于什么的
- uses synonyms for terms in the passage (e.g., “form” for “produce” in question 2 and “get” for “produce” in question 3),
- 使用段落中的同义词
- refers to entities from preceding sentences or clauses (e.g., the use of “photosynthesis” in question 2).
- 涉及到前文或从句中的实体
- Othertimes, world knowledge is employed to produce a good question (e.g., identifying “photosynthesis” as a “life process” in question 1).
- 知识会被用来产生一个好问题


Task Definition

Goal: to generate a natural question y relation information in the sentence

y can be a sequence of an arbitrary length: [ y 1 , , y | y | ] . Suppose the length of the input sentence is M , x could then be represented as a sequence of tokens [ x 1 , . . . , x M ] . The QG task is defined as finding y, such that:

(1) y ¯ = a r g y m a x P ( y | x )




P ( y | x ) = t = 1 | y | P ( y t | x , y < t )

where probability of each y t is predicted based on all the words that are generated previously ( i . e . , y < t ) , and input sentence x .

看公式,最终Qusetion y 的出现概率是每一个词出现概率的乘积,很好理解。

(2) P ( y t | x , y < t ) = s o f t m a x ( W s t a n h ( W t [ h t ; c t ] ) )

with h t being the recurrent neural networks state variable at time step t , and c t being the attention-based encoding of x at decoding time step t (Section 4.2)

h t t 时刻循环神经网络的参数值, c t t 时刻 x 输入编码时的注意力系数, W s W t 是需要学习的参数。

(3) h t = L S T M 1 ( y t 1 , h t 1 )

It generates the new state h t , given the representation of previously generated word y t 1 (obtained from a word look-up table), and the previous state h t 1 .
上一次预测得到的输出 y t 1 和上一时刻的循环网络状态值 h t 1 被用来生成当前值 y t


The initialization of the decoder’s hidden state differentiates our basic model and the model that incorporates paragraph-level information.

  • 基础模型用了句子状态做初始化
  • 段落模型使用了句子状态和段落状态共同做初始化



b t = L S T M 2 ( x t , b t 1 )

b t = L S T M 2 ( x t , b t + 1 )

使用双向LSTM编码 t 时刻的输入。
为了在解码的时候得到注意力系数 c t ,我们需要得到有上下文依赖的令牌表示 b t = [ b t , b t ] ,然后求 b t ( t = 1 , , | x | ) 的平均值,

(4) c t = i = 1 , , | x | a i , t b i

我的理解 c t 是每一个 t 时刻,对 b i 求加权平均。
而这个加权系数 a i , t 由双向LSTM的评分函数和Softmax正则化求得:
(5) a i , t = e x p ( h t T W b b i ) j e x p ( h t T W b b j )

为了初始化解码器的隐藏状态,需要把Bi-LSTM的隐藏状态结合起来 s = [ b | x | , b 1 ]

Paragraph encoder

段落级别的编码器和句子级别一样,当句子过长时我们给一个阈值 L ,截断超过长度的句子。如果用 z 来表示段落,那么用另一个LSTM来编码 z :

d t = L S T M 3 ( z t , d t 1 )

d t = L S T M 3 ( z t , d t + 1 )

同样的段落编码器的输出 s
s = [ d | z | , d 1 ] **

Train and Inference

给定句子和问题的输入对 S = { ( x ( i ) , y ( i ) ) } i = 1 S

= i = 1 S l o g P ( y ( i ) | x ( i ) ; θ ) = i = 1 S j = 1 y ( i ) l o g P ( y i ( i ) | x ( i ) , y < j ( i ) ; θ )

但是编码时会大量输出UNK,因此对编码时期的UNK 做积极的处理。文章采用了简单替换的方式,用当前时刻,输入句子中注意力分数最高的词替换。


We see that the key words in the output (“introduced”, “teletext”, etc.) aligns well with those in the input sentence.





