利用 TensorFlow 入门 Word2Vec

时间 2019-11-16

标签利用 tensorflow 入门 word2vec word vec 栏目 Microsoft Office 繁體版

原文原文链接

做者：chen_h
微信号 & QQ：862251340
微信公众号：coderpai
个人博客：请点击这里html

我认为学习算法的最好方法就是尝试去实现它，所以这个教程咱们就来学习如何利用 TensorFlow 来实现词嵌入。node

这篇文章咱们不会去过多的介绍一些词向量的内容，因此不少 king - man - woman - queue 的例子会被省去，直接进入编码实践过程。

咱们如何设计这些词嵌入？

对于如何设计词嵌入有不少的技术，这里咱们讨论一种很是有名的技术。与咱们往常的认知不一样，word2vec 并非一个深层的网络，它只是一个三层的浅层网络。git

注意：word2vec 有不少的技术细节，可是咱们会跳过这些细节，来使得更加容易理解。github

word2vec 如何工做？

word2vec 算法的设计以下：算法

它是一个三层的网络（一个输入层 + 一个隐藏层 + 一个输出层）。
模型输入一个词，而后去预测它周围的词。
移除最后一层（输出层），保留输入层和隐藏层。
如今，输入一个词库中的词，而后隐藏层的输出就是输入词的词向量。

就是这么简单，这个三层网络就能够获得一个还不错的词向量。bash

接下来就让咱们来实现这个模型。完整的代码能够点击 Github，但我建议你先不要看完整的代码，先一步一步学习。微信

接下来，咱们先定义咱们要处理的原始文本：网络

import numpy as np
import tensorflow as tf
corpus_raw = 'He is the king . The king is royal . She is the royal queen '

# convert to lower case
corpus_raw = corpus_raw.lower()复制代码

如今，咱们须要将输入的原始文本数据转换成一个输入输出对，以便咱们对输入的词，能够去预测它附近的词。好比，咱们肯定一个中心词，窗口大小 app

window_size

设置为 n ，那么咱们就是去预测中心词前面 n 个词和后面 n 个词。Chris McCormick 的这篇博客给出了比较详细的解释。

注意：若是中心词是在句子的开头或者末尾，那么咱们就忽略窗口没法得到的词。

在作这个以前，咱们须要建立一个字典，用来肯定每一个单词的索引，具体以下：dom

words = []
for word in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word复制代码

这个字典的运行结果以下：

print(word2int['queen'])
-> 42 (say)

print(int2word[42])
-> 'queen'复制代码

接下来，咱们将咱们的句子向量转换成单词列表，以下：

# raw sentences is a list of sentences.
raw_sentences = corpus_raw.split('.')
sentences = []
for sentence in raw_sentences:
    sentences.append(sentence.split())复制代码

上面代码将帮助咱们获得一个句子的列表，列表中的每个元素是句子的单词列表，以下：

print(sentences)

-> [['he', 'is', 'the', 'king'], ['the', 'king', 'is', 'royal'], ['she', 'is', 'the', 'royal', 'queen']]复制代码

接下来，咱们要产生咱们的训练数据：

data = []

WINDOW_SIZE = 2

for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])复制代码

这个程序给出了单词输入输出对，咱们将窗口的大小设置为 2。

print(data)
[['he', 'is'],
 ['he', 'the'],
 ['is', 'he'],
 ['is', 'the'],
 ['is', 'king'],
 ['the', 'he'],
 ['the', 'is'], 
.
.
.
]复制代码

至此，咱们有了咱们的训练数据，可是咱们须要将它转换成计算机能够理解的表示，即数字。也就是咱们以前设计的 word2int 字典。

咱们再进一步表示，将这些数字转换成 0-1 向量。

i.e., 
say we have a vocabulary of 3 words : pen, pineapple, apple
where 
word2int['pen'] -> 0 -> [1 0 0]
word2int['pineapple'] -> 1 -> [0 1 0]
word2int['apple'] -> 2 -> [0 0 1]复制代码

那么为何要表示成 0-1 向量呢？这个问题咱们后续讨论。

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)复制代码

如今，咱们有了 x_train 和 y_train 数据：

print(x_train)
->
[[ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.  0.  0.]]复制代码

这两个数据的维度以下：

print(x_train.shape, y_train.shape)
->
(34, 7) (34, 7)
# meaning 34 training points, where each point has 7 dimensions复制代码

构造 TensorFlow 模型

# making placeholders for x_train and y_train

x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))复制代码

从上图中能够看出，咱们将训练数据转换成了另外一种向量表示。

EMBEDDING_DIM = 5 # you can choose your own number
W1 = tf.Variable(tf.random_normal([vocab_size, EMBEDDING_DIM]))
b1 = tf.Variable(tf.random_normal([EMBEDDING_DIM])) #bias
hidden_representation = tf.add(tf.matmul(x,W1), b1)复制代码

接下来，咱们对隐藏层的数据进行处理，而且对其附近的词进行预测。预测词的方法咱们采用 softmax 方法。

W2 = tf.Variable(tf.random_normal([EMBEDDING_DIM, vocab_size]))
b2 = tf.Variable(tf.random_normal([vocab_size]))
prediction = tf.nn.softmax(tf.add( tf.matmul(hidden_representation, W2), b2))复制代码

因此，完整的模型是：

input_one_hot  --->  embedded repr. ---> predicted_neighbour_prob
predicted_prob will be compared against a one hot vector to correct it.复制代码

如今，咱们能够训练这个模型：

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init) #make sure you do this!
# define the loss function:
cross_entropy_loss = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(prediction), reduction_indices=[1]))
# define the training step:
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy_loss)
n_iters = 10000
# train for n_iter iterations
for _ in range(n_iters):
    sess.run(train_step, feed_dict={x: x_train, y_label: y_train})
    print('loss is : ', sess.run(cross_entropy_loss, feed_dict={x: x_train, y_label: y_train}))复制代码

在训练的过程当中，你在控制台能够获得以下结果：

loss is :  2.73213
loss is :  2.30519
loss is :  2.11106
loss is :  1.9916
loss is :  1.90923
loss is :  1.84837
loss is :  1.80133
loss is :  1.76381
loss is :  1.73312
loss is :  1.70745
loss is :  1.68556
loss is :  1.66654
loss is :  1.64975
loss is :  1.63472
loss is :  1.62112
loss is :  1.6087
loss is :  1.59725
loss is :  1.58664
loss is :  1.57676
loss is :  1.56751
loss is :  1.55882
loss is :  1.55064
loss is :  1.54291
loss is :  1.53559
loss is :  1.52865
loss is :  1.52206
loss is :  1.51578
loss is :  1.50979
loss is :  1.50408
loss is :  1.49861
.
.
.复制代码

随着损失值的不断降低，最终会达到一个稳定值。即便咱们没法得到很精确的结果，可是咱们也不在意，由于咱们感兴趣的是 W1 和 b1 的值，即隐藏层的权重。

让咱们来看看这些权重，以下：

print(sess.run(W1))
print('----------')
print(sess.run(b1))
print('----------')

->
[[-0.85421133  1.70487809  0.481848   -0.40843448 -0.02236851]
 [-0.47163373  0.34260952 -2.06743765 -1.43854153 -0.14699034]
 [-1.06858993 -1.10739779  0.52600187  0.24079895 -0.46390489]
 [ 0.84426647  0.16476244 -0.72731972 -0.31994426 -0.33553854]
 [ 0.21508843 -1.21030915 -0.13006891 -0.24056002 -0.30445012]
 [ 0.17842589  2.08979321 -0.34172744 -1.8842833  -1.14538431]
 [ 1.61166084 -1.17404735 -0.26805425  0.74437028 -0.81183684]]
----------
[ 0.57727528 -0.83760375  0.19156453 -0.42394346  1.45631313]
----------复制代码

为何采用 0-1 向量？

当咱们将一个 0-1 向量与 W1 相乘时，咱们基本上能够将 W1 与 0-1 向量对应的那个 1 相乘的结果就是词向量。也就是说， W1 就是一个数据查询表。

在咱们的程序中，咱们也添加了一个偏置项 b1 ，因此咱们也须要将它加上。

vectors = sess.run(W1 + b1)

# if you work it out, you will see that it has the same effect as running the node hidden representation
print(vectors)
->
[[-0.74829113 -0.48964909  0.54267412  2.34831429 -2.03110814]
 [-0.92472583 -1.50792813 -1.61014366 -0.88273793 -2.12359881]
 [-0.69424796 -1.67628145  3.07313657 -1.14802659 -1.2207377 ]
 [-1.7077738  -0.60641652  2.25586247  1.34536338 -0.83848488]
 [-0.10080346 -0.90931684  2.8825531  -0.58769202 -1.19922316]
 [ 1.49428082 -2.55578995  2.01545811  0.31536022  1.52662396]
 [-1.02735448  0.72176981 -0.03772151 -0.60208392  1.53156447]]复制代码

若是咱们想获得 queen 的向量，咱们能够用以下表示：

print(vectors[ word2int['queen'] ])
# say here word2int['queen'] is 2
-> 
[-0.69424796 -1.67628145  3.07313657 -1.14802659 -1.2207377 ]复制代码

那么这些漂亮的向量有什么用呢？

咱们写一个如何去查找最相近向量的函数，固然这个写法是很是简单粗糙的。

def euclidean_dist(vec1, vec2):
    return np.sqrt(np.sum((vec1-vec2)**2))

def find_closest(word_index, vectors):
    min_dist = 10000 # to act like positive infinity
    min_index = -1
    query_vector = vectors[word_index]
    for index, vector in enumerate(vectors):
        if euclidean_dist(vector, query_vector) < min_dist and not np.array_equal(vector, query_vector):
            min_dist = euclidean_dist(vector, query_vector)
            min_index = index
    return min_index复制代码

接下来，让咱们来测试一下单词 king ，queen 和 royal 这些词。

print(int2word[find_closest(word2int['king'], vectors)])
print(int2word[find_closest(word2int['queen'], vectors)])
print(int2word[find_closest(word2int['royal'], vectors)])

->
queen
king
he复制代码

咱们能够获得以下有趣的结果。

king is closest to queen
queen is closest to king
royal is closest to he复制代码

第三个数据是咱们根据大型语料库得出来的（看起来还不错）。语料库的数据更大，咱们获得的结果会更好。（注意：因为权重是随机初始化的，因此咱们可能会获得不一样的结果，若是有须要，咱们能够多运行几回。）

让咱们来画出这个向量相关图。

首先，咱们须要利用将为技术将维度从 5 减少到 2，所用的技术是：tSNE（teesnee！）

from sklearn.manifold import TSNE
model = TSNE(n_components=2, random_state=0)
np.set_printoptions(suppress=True)
vectors = model.fit_transform(vectors)复制代码

而后，咱们须要对结果进行规范化，以便咱们能够在 matplotlib 中更好的对它进行查看。

from sklearn import preprocessing
normalizer = preprocessing.Normalizer()
vectors =  normalizer.fit_transform(vectors, 'l2')复制代码

最后，咱们将绘制出图。

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
for word in words:
    print(word, vectors[word2int[word]][1])
    ax.annotate(word, (vectors[word2int[word]][0],vectors[word2int[word]][1] ))
plt.show()复制代码

从图中，咱们能够看出。she 跟 queen 的距离很是接近，king 与 royal 的距离和 king 与 queen 的距离相同。若是咱们有一个更大的语料库，咱们能够获得更加复杂的关系图。

为何会发生这些？

咱们给神经网络的任务是预测单词的相邻词。可是咱们尚未具体的分析神经网络是如何预测的。所以，神经网络找出单词的向量表示，用来帮助它预测相邻词这个任务。预测相邻词这自己不是一个有趣的任务，咱们关心的是隐藏层的向量表示。

为了获得这些表示，神经网络使用了上下文信息。在咱们的语料库中，king 和 royal 是做为相邻词出现的，queen 和 royal 也是做为相邻词出现的。

为何把预测相邻词做为一个任务？

其余的任务也能够用来训练这个词向量任务，好比利用 n-gram 就能够训练出很好的词向量！这里有一篇博客有详细解释。

那么，咱们为何还要使用相邻词预测做为任务呢？由于有一个比较著名的模型称为 skip gram 模型。咱们可使用中间词的相邻单词做为输入，并要求神经网络去预测中间词。这被称为连续词袋模型。

总结

词向量是很是酷的一个工具。
不要在实际生产环境中使用这个 TensorFlow 代码，咱们这里只是为了理解才这样写。生产环境建议使用一些成熟的工具包，好比 gensim

我但愿这个简单教程能够帮助到一些人，能够更加深入的理解什么是词向量。

CoderPai 是一个专一于算法实战的平台，从基础的算法到人工智能算法都有设计。若是你对算法实战感兴趣，请快快关注咱们吧。加入AI实战微信群，AI实战QQ群，ACM算法微信群，ACM算法QQ群。详情请关注 “CoderPai” 微信号（coderpai）。

1. 利用 TensorFlow 入门 Word2Vec
2. word2vec tensorflow
3. Tensorflow之word2vec
4. word2vec入门简介
5. Word2vec-tensorflow版实现
6. 49、word2vec - tensorflow
7. word2vec 入门基础（一）
8. Word2vec 入门（skip-gram部分)
9. 【CS224n】笔记7 TensorFlow入门
10. 【NLP】Word2vec简介，入门
更多相关文章...
• Memcached入门教程 - NoSQL教程
• netwox网络工具集入门教程 - TCP/IP教程
• YAML 入门教程
• Java Agent入门实战（一）-Instrumentation介绍与使用