前面已经说过了无监督学习的概念。无监督学习在实际的工做中应用仍是比较多见的。
从典型的应用上说,监督学习比较多用在“分类”上,利用给定的数据,作出一个决策,这个决策在有限的给定可能性中选择其中一种。各种识别、自动驾驶等都属于这一类。
无监督学习则是“聚类”,算法自行寻找输入数据集的规律,并把它们按照规律分别组合,一样特征的放到一个类群。像天然语言理解、推荐算法、数据画像等,都属于这类(实际实现中仍是比较多用半监督学习,但最先概念的导入仍是属于无监督学习)。
无监督学习的确是没有人工的标注,但全部输入的数据都必须保持原有的、必然存在的内在规律。为了保持这些规律或者挑选典型的规律,常常仍是须要一些人力。
介于二者之间的还有半监督学习,好比一半数据有标注,一半数据无标注。经过已标注数据分类,而后将无标注数据“聚类”到已知类型中去。从实现原理上或者组合了两种算法,或者实际上更倾向于监督学习,这里就不单独拿出来讲了。
前面看过了很多监督学习的例子,但尚未展现过无监督学习。今天就来剖析一个。html
单词向量化是比较典型的无监督学习。这个概念的本意是这样:在天然语言处理(NLP)中,理解单词的含义是重要的一部分工做。由于咱们说过,机器学习的本质是数学运算,解方程。此外单词的长度都不一致,根据归一化的原则,首先要作的事情就是把单词数字化成为统一的维度和数量级,就是每一个单词用一个数字代替。几十年前的电报编码其实就是这个意思,通常经常使用的单词会用比较短的数字,这样数字化以后的长度更短,经常使用单词由于靠前,被检索的速度也会快。
可是这样也带来一个大问题,就是单词本来是有一些内部隐藏含义的,好比man/woman。明显有些相关性的单词,数字化以后假设一个是56,一个是34,其中内部的含义就彻底丢失了。cat / dog /animal这样的单词也是一样的,这丢失掉的信息对于NLP来说,实际是很重要的部分。
所以单词向量化的解决方法就是,把全部的单词嵌入到(embeding)到一个连续的向量空间中去。词义相近或者单词有潜在关联的单词,在向量空间中两个单词之间的距离就近。这个距离也能够做为衡量两个单词类似程度的标准。由此,单词向量化,也称为“word embedding”。
由于单词向量化的工做是如此重要,TensorFlow官方提供了从低到高一整套示范或工具。python
几乎全部实现单词向量化的算法都依赖于分布式假设,其核心思想为假定出现于相同上下文情景中的词汇都有相相似的语义。这个概念可能有些含糊,举个例子“我吃了个苹果”是一句话,另一句话是“我吃了个香蕉”。做为非监督学习,这两句话不会作任何标注,可是通过训练的模型应当能理解“苹果”跟“香蕉”这两个词具备高度类似性,换言之,这两个词在向量空间中,应当具备很接近的距离。
为了用算法实现这个概念,一般有两种方法:计数法和推理法。
计数法:在大型语料库中对某词汇及其临近词汇进行统计计数,记下多种指标好比出现频率等,而后再根据这些量把全部单词映射到向量空间中去。
推理法:也叫预测法,首先假设已经存在一个向量空间,利用这个空间中已经有的数据,经过某词汇的临近词汇,对词汇自己进行预测,对错误的预测在向量空间中调整其位置。
其实这两种方法常常是结合使用的。
在word2vec实例中,使用了基于极大似然法的几率化语言模型对连续单词进行关联性预测。极大似然法的资料能够参考最下面的参考连接部分,有一组公式用于实现这个算法。
随后咱们有了任何一个单词以后,根据单词上下文,上下文的定义在程序中是能够设置的,咱们采用单词左边1个及右边1个单词做为上下文。举例说有一句话:
the quick brown fox jumped over the lazy dog
以上下文相关的方式对每一个单词进行分组能够获得:
([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
而后咱们就能够利用the brown预测quick,用quick fox预测brown。这种预测方式也叫连续词袋模型(CBOW)。还有一种方式是反过来,同上例,好比咱们用quick预测the brown,这样叫:Skip-Gram模型。
从时间复杂性上说,CBOW算法适合较小的数据集,但准确度更高(用多个单词预测1个单词),Skip-Gram则适合较大数据集(用1个单词预测多个单词)。c++
#!/usr/bin/env python # -*- coding=UTF-8 -*- # Copyright 2015 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================== """Basic word2vec example.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function import collections import math import os import random from tempfile import gettempdir import zipfile import numpy as np from six.moves import urllib from six.moves import xrange # pylint: disable=redefined-builtin import tensorflow as tf # Step 1: Download the data. url = 'http://mattmahoney.net/dc/' # 从上面的URL下载给定的语料库 # 为了提升速度,这里手工下载后屏蔽了本函数,防止每次运行都重复下载速度太慢 # pylint: disable=redefined-outer-name def maybe_download(filename, expected_bytes): """Download a file if not present, and make sure it's the right size.""" local_filename = os.path.join(gettempdir(), filename) if not os.path.exists(local_filename): local_filename, _ = urllib.request.urlretrieve(url + filename, local_filename) statinfo = os.stat(local_filename) if statinfo.st_size == expected_bytes: print('Found and verified', filename) else: print(statinfo.st_size) raise Exception('Failed to verify ' + local_filename + '. Can you get to it with a browser?') return local_filename #单词数据包实际下载路径:http://mattmahoney.net/dc/text8.zip #在这里下载后放到当前目录,因此下面filename作了修改,而且再也不调用maybe_download函数 #filename = maybe_download('text8.zip', 31344016) filename = "./text8.zip" #从zip包中第一个文件读取全部的数据(实际只有一个文本文件), #全部的数据只有词,以空格分割,没有标点符号。 #单词之间有语序关系,意思是某文章去掉标点符号以后,每句话中单词的语序仍然存在。 #为了加深印象,能够解压缩text8.zip包,而后显示文本文件看一下, #文件很大,建议使用more只查看一部分 # Read the data into a list of strings. def read_data(filename): """Extract the first file enclosed in a zip file as a list of words.""" with zipfile.ZipFile(filename) as f: data = tf.compat.as_str(f.read(f.namelist()[0])).split() return data #读取全部单词到字符串数组 vocabulary = read_data(filename) print('Data size', len(vocabulary)) #总体数据,按照这个下载包是17005207个单词,下面50000是为了演示速度,限制了有效词数 # Step 2: Build the dictionary and replace rare words with UNK token. vocabulary_size = 50000 def build_dataset(words, n_words): """Process raw inputs into a dataset.""" count = [['UNK', -1]] #按照相同词统计数,进行排序,经常使用词在前面,最前面固然是UNK #其次是the/of/and/one,排序靠前的5个单词后面会显示出来... count.extend(collections.Counter(words).most_common(n_words - 1)) dictionary = dict() for word, _ in count: #计数增长,用于给每一个词编一个惟一数字代码 #UNK是第0个,编码是0,由于加入第一个UNK的时候,dictionary是空,因此len是0。 #输出看的时候,由于是以单词顺序列出来,因此看着顺序混乱, #实际看反查表由于数字在前,看起来会更明显。 dictionary[word] = len(dictionary) data = list() #data最终是数字化以后的words,也就是数字化以后的原文, #其中按照原文顺序,每一个元素,是该单词的数字编码 #数字编码是从dictionary中查表找到的,也就是本函数前面数字化单词的过程获得的 unk_count = 0 for word in words: index = dictionary.get(word, 0) if index == 0: # dictionary['UNK'] unk_count += 1 data.append(index) count[0][1] = unk_count #这个逆转字典是从数字到单词来对应,双向查表用的 reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return data, count, dictionary, reversed_dictionary # Filling 4 global variables: # data - list of codes (integers from 0 to vocabulary_size-1). # This is the original text but words are replaced by their codes # count - map of words(strings) to count of occurrences # dictionary - map of words(strings) to their codes(integers) # reverse_dictionary - maps codes(integers) to words(strings) #使用build_dataset函数填充4个全局变量, #这些全局变量的内容刚才在函数注释中咱们都介绍过了 #也能够参考上面官方本来的英文注释 data, count, dictionary, reverse_dictionary = build_dataset(vocabulary, vocabulary_size) #作完了上面的数字化,原文其实就没用了,这里删除以节省内存 del vocabulary # Hint to reduce memory. #count表上面说了,是统计出现次数,这里列出最常出现的5个单词 print('Most common words (+UNK)', count[:5]) #数字化后的前10个单词及查表得出的原文, #注意后面的逆向表查表部分是python特有的语法,其它语言中很少见 print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]]) data_index = 0 # Step 3: Function to generate a training batch for the skip-gram model. def generate_batch(batch_size, num_skips, skip_window): global data_index assert batch_size % num_skips == 0 assert num_skips <= 2 * skip_window batch = np.ndarray(shape=(batch_size), dtype=np.int32) labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) span = 2 * skip_window + 1 # [ skip_window target skip_window ] buffer = collections.deque(maxlen=span) if data_index + span > len(data): data_index = 0 buffer.extend(data[data_index:data_index + span]) data_index += span for i in range(batch_size // num_skips): context_words = [w for w in range(span) if w != skip_window] words_to_use = random.sample(context_words, num_skips) for j, context_word in enumerate(words_to_use): batch[i * num_skips + j] = buffer[skip_window] labels[i * num_skips + j, 0] = buffer[context_word] if data_index == len(data): buffer.extend(data[0:span]) data_index = span else: buffer.append(data[data_index]) data_index += 1 # Backtrack a little bit to avoid skipping words in the end of a batch data_index = (data_index + len(data) - span) % len(data) return batch, labels #生成一批用于学习的数据集,这里首先生成一批很小的量 #而后在下面显示出来,用于人为观察生成的数据集是否合理 batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1) for i in range(8): print(batch[i], reverse_dictionary[batch[i]], '->', labels[i, 0], reverse_dictionary[labels[i, 0]]) # Step 4: Build and train a skip-gram model. #这里定义的常量,才是真正学习的时候生成数据集的尺寸等参数 batch_size = 128 embedding_size = 128 # Dimension of the embedding vector. # 左右各考虑1个单词 skip_window = 1 # How many words to consider left and right. # 本窗口完成跳2个单词取样下一个窗口 num_skips = 2 # How many times to reuse an input to generate a label. num_sampled = 64 # Number of negative examples to sample. # We pick a random validation set to sample nearest neighbors. Here we limit the # validation samples to the words that have a low numeric ID, which by # construction are also the most frequent. These 3 variables are used only for # displaying model accuracy, they don't affect calculation. valid_size = 16 # Random set of words to evaluate similarity on. valid_window = 100 # Only pick dev samples in the head of the distribution. valid_examples = np.random.choice(valid_window, valid_size, replace=False) graph = tf.Graph() with graph.as_default(): # Input data. train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) # Ops and variables pinned to the CPU because of missing GPU implementation with tf.device('/cpu:0'): # Look up embeddings for inputs. embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) # Construct the variables for the NCE loss nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size])) # Compute the average NCE loss for the batch. # tf.nce_loss automatically draws a new sample of the negative labels each # time we evaluate the loss. # Explanation of the meaning of NCE loss: # http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=train_labels, inputs=embed, num_sampled=num_sampled, num_classes=vocabulary_size)) # Construct the SGD optimizer using a learning rate of 1.0. optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) # Compute the cosine similarity between minibatch examples and all embeddings. norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup( normalized_embeddings, valid_dataset) similarity = tf.matmul( valid_embeddings, normalized_embeddings, transpose_b=True) # Add variable initializer. init = tf.global_variables_initializer() # Step 5: Begin training. num_steps = 100001 with tf.Session(graph=graph) as session: # We must initialize all variables before we use them. init.run() print('Initialized') average_loss = 0 for step in xrange(num_steps): batch_inputs, batch_labels = generate_batch( batch_size, num_skips, skip_window) #能够在tensorflow运算过程当中逐批次喂入的数据集是由tf.placeholder定义的, #这里把全部要喂入的数据先包装成dict feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels} # We perform one update step by evaluating the optimizer op (including it # in the list of returned values for session.run() #运行,并逐批次喂入数据 _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict) average_loss += loss_val if step % 2000 == 0: if step > 0: average_loss /= 2000 # The average loss is an estimate of the loss over the last 2000 batches. #这里显示的是每2000批次平均出来的代价函数返回值 print('Average loss at step ', step, ': ', average_loss) average_loss = 0 # Note that this is expensive (~20% slowdown if computed every 500 steps) if step % 10000 == 0: sim = similarity.eval() for i in xrange(valid_size): valid_word = reverse_dictionary[valid_examples[i]] top_k = 8 # number of nearest neighbors nearest = (-sim[i, :]).argsort()[1:top_k + 1] log_str = 'Nearest to %s:' % valid_word for k in xrange(top_k): close_word = reverse_dictionary[nearest[k]] log_str = '%s %s,' % (log_str, close_word) print(log_str) final_embeddings = normalized_embeddings.eval() # Step 6: Visualize the embeddings. # pylint: disable=missing-docstring # Function to draw visualization of distance between embeddings. def plot_with_labels(low_dim_embs, labels, filename): assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings' plt.figure(figsize=(18, 18)) # in inches for i, label in enumerate(labels): x, y = low_dim_embs[i, :] plt.scatter(x, y) plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') plt.savefig(filename) try: # pylint: disable=g-import-not-at-top from sklearn.manifold import TSNE import matplotlib.pyplot as plt tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact') plot_only = 500 low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :]) labels = [reverse_dictionary[i] for i in xrange(plot_only)] plot_with_labels(low_dim_embs, labels, './tsne.png') except ImportError as ex: print('Please install sklearn, matplotlib, and scipy to show embeddings.') print(ex)
源码没有使用从main()开始的函数式编程风格,较多的使用了过程式语言的方式。一块功能定义一个函数,而后接着就在python的全局开始初始化和调用刚才的函数,随后接着是下一个函数和相应的调用。
除了之前见过的部分,源码中都作了比较多的注释。下面再对一些重点部分作一个讲解。
讲解以前为了理解方便,这里先把语料库摘个开头贴一下:git
anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist society might work vary considerably especially with respect to economics there is also disagreement about how a free society might be brought about origins and predecessors kropotkin and others argue that before recorded history human society was organized on anarchist principles most anthropologists follow kropotkin and engels in believing that hunter gatherer bands were egalitarian and lacked division of labour accumulated wealth or decreed law and had equal access to resources william godwin anarchists including the the anarchy organisation and rothbard find anarchist attitudes in taoism from ancient china kropotkin found similar ideas in stoic zeno of citium according to kropotkin zeno repudiated the omnipotence of the state its intervention and regimentation and proclaimed the sovereignty of the moral law of the individual the anabaptists of one six th century europe are sometimes considered to be religious forerunners of modern anarchism bertrand russell in his history of western philosophy writes that the anabaptists repudiated all law since they held that the good man will be guided at every moment by the holy spirit from
语料库是一个连续的文本文件,其中每一个单词之间用一个空格隔开,没有标点符号、没有换行符等控制字符(因此上面的摘录,在终端中看是不少行,在这里显示为1行)。
参考官方的讲解,咱们这里也把程序分红6个部分:github
检测本地若是没有语料库,则去网上下载,下载路径是:http://mattmahoney.net/dc/text8.zip
。
同之前的例子相同,由于这个下载包压缩后30多M,我手工下载了语料库,简单的修改了程序,直接从当前目录打开text8.zip
文件,以便节省时间。
比后面进阶示例好的地方是,本例中使用了zipfile包来直接读取压缩包中的语料库,不用再解压出来,不然但是100多M的一个文本文件。
单词会读到vocabulary数组,每一个单词占用一个数组元素。数组的顺序就是原来在语料库中单词出现的顺序。算法
进行基本的数据整理, 示例起见,这里只使用前面的50000个单词对模型进行训练。在训练集中,统计单词的出现频率,并根据频率生成字典dictionary。字典中频率高的靠前放,在字典中的排名将是这个单词的编号。出现不多的单词替换为“UNK”(由于这种出现很是少的单词没有参考对象,没法进行训练和预测。所以干脆用UNK代替,等因而剔除)。UNK在字典中是第1个,编号是0;后面则按照出现频率排序。程序开始运行时会显示前5个高频词,应当以下:
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
也说明the的编号为1,of是2,and是3,one是4。
随后使用这个字典,对整个语料库进行数字化,数字化的结果存在data之中,其中稀有词UNK已经被去掉。完成后将是相似这样:
5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156
,这组数字表明原文中的前10个单词:
'anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against'
最后,考虑到单词数字化以后,还会须要被逆转成单词,所以又生成了reversed_dictionary字典,其中键值是数字,值则是单词,用于逆向检索。express
定义了一个函数,用于生成训练用的数据集。根据训练的特色,训练集是批次生成的。定义完这个函数,使用了一个很小的量(程序中是8)实验生成了一下。这里重点须要理解函数的3个参数:batch_size是每批次生成的单词量;num_skips表明单词窗口移动时跳过的单词数;skip_window是当前单词左右几个单词做为本单词的上下文。前面讲过了,Skip_gram算法是用当前单词,在训练好的模型中预测这个上下文。apache
还有一点要讲解的是,咱们前面的例子一直强调归一化的重要性,在word2vec中,除了数字化,基本没有别的归一化动做。缘由不少,最主要的是,在之前的例子中,咱们更关注量的概念,拟合到比较接近的数值就算很好的结果。而对数字化以后的单词来说,每一个整数对应一个单词,不可能有小数,就算值相差1,也表明了彻底不一样的单词。所以在本例中没有传统转换成浮点数那种归一化操做。编程
来看个小例子:api
#!/usr/bin/env/python # coding=utf-8 import tensorflow as tf import numpy as np input_ids = tf.placeholder(dtype=tf.int32, shape=[None]) #定义一个5x5对角矩阵,样式能够看运行结果第一个输出 embedding = tf.Variable(np.identity(5, dtype=np.int32)) #使用embedding_lookup检索矩阵,检索数据集是input_ids input_embedding = tf.nn.embedding_lookup(embedding, input_ids) sess = tf.InteractiveSession() sess.run(tf.global_variables_initializer()) print(embedding.eval()) print(sess.run(input_embedding, feed_dict={input_ids:[1, 2, 3, 0, 3, 2, 1]}))
执行结果:
embedding = [[1 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 0] [0 0 0 0 1]] input_embedding = [[0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 0] [1 0 0 0 0] [0 0 0 1 0] [0 0 1 0 0] [0 1 0 0 0]]
embedding_lookup的功能,就是根据input_ids中的id,寻找embedding中的对应行的元素,逐行结果组合在一块儿,成为一个新的矩阵返回。好比上面就是embedding第一、二、三、0、三、二、1行的结果,从新组合成一个7行的矩阵返回给input_embedding。
进阶版本源码是一个基本能够应用的实例,在项目页面的介绍中有使用办法,但在macOS中运行有些问题,这里作个说明。
首先是编译,我是用的TensorFlow1.4.1版本没有这个方法tf.sysconfig.get_compile_flags()
,没法获得正确的编译参数,最后只好写了一个脚本进行编译:
#!/bin/sh TF_CFLAGS="-I/usr/local/lib/python2.7/site-packages/tensorflow/include" TF_LFLAGS="-L/usr/local/lib/python2.7/site-packages/tensorflow" g++ -std=c++11 -shared word2vec_ops.cc word2vec_kernels.cc -o word2vec_ops.so -fPIC ${TF_CFLAGS} ${TF_LFLAGS} -O2 -D_GLIBCXX_USE_CXX11_ABI=0 -undefined dynamic_lookup
方法就是人工找到INCLUDE和LIB路径,将路径设置为常量,在编译中直接给定。
须要注意是在macOS上编译,必须使用-undefined dynamic_lookup
,否则连接的时候会报错。
编译以后获得的so文件,会在python程序中使用以下方法调用:
word2vec = tf.load_op_library(os.path.join(os.path.dirname(os.path.realpath(__file__)), 'word2vec_ops.so')) ... (words, counts, words_per_epoch, current_epoch, total_words_processed, examples, labels) = word2vec.skipgram_word2vec(filename=opts.train_data, batch_size=opts.batch_size, window_size=opts.window_size, min_count=opts.min_count, subsample=opts.subsample)
数据文件的准备使用官方给出的命令没有问题:
curl http://mattmahoney.net/dc/text8.zip > text8.zip unzip text8.zip curl https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip > source-archive.zip unzip -p source-archive.zip word2vec/trunk/questions-words.txt > questions-words.txt rm text8.zip source-archive.zip
用于评估训练结果的问题集由于在google的服务器上,可能须要FQ才能下载。
最后word2vec_optimized.py的执行结果以下:
2018-01-16 13:17:14.277603: I word2vec_kernels.cc:200] Data file: /Users/andrew/dev/tensorFlow/word2vec/text8 contains 100000000 bytes, 17005207 words, 253854 unique words, 71290 unique frequent words. Data file: /Users/andrew/dev/tensorFlow/word2vec/text8 Vocab size: 71290 + UNK Words per epoch: 17005207 Eval analogy file: /Users/andrew/dev/tensorFlow/word2vec/questions-words.txt Questions: 17827 Skipped: 1717 Epoch 1 Step 150943: lr = 0.024 words/sec = 31527 Eval 1469/17827 accuracy = 8.2% Epoch 2 Step 301913: lr = 0.023 words/sec = 25120 Eval 2395/17827 accuracy = 13.4% Epoch 3 Step 452887: lr = 0.021 words/sec = 8842 Eval 3014/17827 accuracy = 16.9% Epoch 4 Step 603871: lr = 0.020 words/sec = 6615 Eval 3532/17827 accuracy = 19.8% Epoch 5 Step 754815: lr = 0.019 words/sec = 3007 Eval 3994/17827 accuracy = 22.4% Epoch 6 Step 905787: lr = 0.018 words/sec = 26590 Eval 4320/17827 accuracy = 24.2% Epoch 7 Step 1056767: lr = 0.016 words/sec = 35439 Eval 4714/17827 accuracy = 26.4% Epoch 8 Step 1207755: lr = 0.015 words/sec = 401 Eval 4965/17827 accuracy = 27.9% Epoch 9 Step 1358735: lr = 0.014 words/sec = 36991 Eval 5276/17827 accuracy = 29.6% Epoch 10 Step 1509744: lr = 0.013 words/sec = 25069 Eval 5415/17827 accuracy = 30.4% Epoch 11 Step 1660729: lr = 0.011 words/sec = 28271 Eval 5649/17827 accuracy = 31.7% Epoch 12 Step 1811667: lr = 0.010 words/sec = 29973 Eval 5880/17827 accuracy = 33.0% Epoch 13 Step 1962606: lr = 0.009 words/sec = 10225 Eval 6015/17827 accuracy = 33.7% Epoch 14 Step 2113546: lr = 0.008 words/sec = 21419 Eval 6270/17827 accuracy = 35.2% Epoch 15 Step 2264489: lr = 0.006 words/sec = 27059 Eval 6434/17827 accuracy = 36.1%
程序看上去要复杂不少。主要目的是展现把耗时的操做、而TensorFlow中又没有实现的算法,用c++写成TensorFlow扩展包的形式来实现一个复杂的机器学习模型。因此这里不过多说源码,有兴趣的读者能够自行分析。
最后看一下用于评估的问题库的格式:
: capital-common-countries Athens Greece Baghdad Iraq Athens Greece Bangkok Thailand Athens Greece Beijing China Athens Greece Berlin Germany Athens Greece Bern Switzerland Athens Greece Cairo Egypt Athens Greece Canberra Australia Athens Greece Hanoi Vietnam Athens Greece Havana Cuba Athens Greece Helsinki Finland Athens Greece Islamabad Pakistan Athens Greece Kabul Afghanistan Athens Greece London England Athens Greece Madrid Spain Athens Greece Moscow Russia Athens Greece Oslo Norway Athens Greece Ottawa Canada Athens Greece Paris France Athens Greece Rome Italy Athens Greece Stockholm Sweden Athens Greece Tehran Iran Athens Greece Tokyo Japan Baghdad Iraq Bangkok Thailand Baghdad Iraq Beijing China Baghdad Iraq Berlin Germany Baghdad Iraq Bern Switzerland ...
其中使用冒号“:”开头的行是注释行,程序中会跳过。
随后是城市-首都 城市-首都
这样形式的关联对,4个词在一行。预测方法就是用前三个词,预测最后一个词,若是预测对了,则正确率+1。可见在训练语料库text8跟评估使用的问题集questions-words.txt彻底不一样、且没有任何关联性的两个数据集中,达到36.1%的预测正确率是多么不容易(另外这个示例也没有完成所有的训练,不然正确率还能够提升)。
依赖这种特征,单词向量化也常常用于呼叫中心知识库的智能检索,实现智能回答机器人的一些实现中。
(待续...)
TensorFlow中文社区word2vec讲解
图解word2vec
极大似然法
Dependency-Based Word Embeddings