A Neural Probabilistic Language Model (2003)论文要点

时间 2019-11-09

标签 neural probabilistic language model 论文要点繁體版

原文原文链接

论文连接：http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf算法

解决n-gram语言模型（好比tri-gram以上）的组合爆炸问题，引入词的分布式表示。数组

经过使得类似上下文和类似句子中词的向量彼此接近，所以获得泛化性。网络

相对而言考虑了n-gram没有的更多的上下文和词之间的类似度。分布式

使用浅层网络（好比1层隐层）训练大语料。函数

feature vector维度一般在100之内，对比词典大小一般在17000以上。blog

C是全局共享的向量数组。get

最大化正则log似然函数：it

非归一化的log似然：io

hidden units num = hast

word feature vector dimension = m

context window width = n

output biases b: |V|

hidden layer biases d: h

hidden to output weights U: |V|*h

word feature vector to output weights W: |V|*(n-1)*m

hidden layer weights H: h*(n-1)*m

word reature vector group C: |V|*m

Note that in theory, if there is a weight decay on the weights W and H but not on C, then W and H could converge towards zero while C would blow up. In practice we did not observe such behavior when training with stochastic gradient ascent.

每次训练大部分参数不须要更新。

训练算法：

可改进点：

1. 分红子网络并行训练

2. 输出词典|V|改为树结构，预测每层的条件几率：计算量|V| -> log|V|

3. 梯度重视特别的样本，好比含有歧义词的样本

4. 引入先验知识（词性等）

5. 可解释性

6. 一词多义（一个词有多个词向量）