【一】声明python
本文源自TensorFlow官方指导(https://tensorflow.google.cn/tutorials/sequences/text_generation),增长了部分细节说明。api
【二】综述app
1. tf.keras与keras有以下三个较大的不一样点ide
1):opt必须是tf.train模块下的opt,不能是keras下的opt函数
2):tf.keras模型默认的保存格式时check-point,不是h5.网站
3): tf.keras进行模型训练、推理时,input_data能够直接传递tf.data.Datasetui
2. TensorFlow的官方案例,是依据字符进行文本生成的,基本流程是,给定一个句子,预测其下一个字符。所以该模型不知道单词是怎么拼写的,不知道字成词,由于他是字符级的,它只知道预测下一个字符。所以,可能生成了不存在的单词或者词语。google
3. 该模型只有三层(char-embedding、GRU、FC),但参数巨大,训练十分缓慢(i7CPU训练一个epoch差很少半个小时)。并且这里,char-embedding是直接训练出来了,而不是经过fasttext或者gensim训练出来,而后在作fine-tuning的。lua
【三】代码以下:spa
# -*- coding:utf-8 -*- import tensorflow as tf import numpy as np import os import time tf.enable_eager_execution() # 1. 数据下载 path = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt') #2. 数据预处理 with open(path) as f: # text 是个字符串 text = f.read() # 3. 将组成文本的字符所有提取出来,注意 vocab是个list vocab = sorted(set(text)) # 4. 建立text-->int的映射关系 char2idx = {u:i for i, u in enumerate(vocab)} idx2char = np.array(vocab) text_as_int = np.array([char2idx[c] for c in text]) # 5. 借用dataset的batch方法,将text划分为定长的句子 seq_length = 100 examples_per_epoch = len(text)//seq_length char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) # 这里batch_size 加1的缘由在于,下面对inputs和labels的生成。labels比inputs多一个字符 sequences = char_dataset.batch(seq_length+1, drop_remainder=True) # 6. 将每一个句子划分为inputs和labels。例如:hello,inputs = hell,label=ello def split_input_target(chunk): input_text = chunk[:-1] target_text = chunk[1:] return input_text, target_text dataset = sequences.map(split_input_target) # 7. 将句子划分为一个个batch BATCH_SIZE = 64 steps_per_epoch = examples_per_epoch//BATCH_SIZE BUFFER_SIZE = 10000 # drop_remainder 通常须要设置为true,表示当最后一组数据不够划分为一个batch时,将这组数据丢弃 dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True) # 8. 模型搭建 # Length of the vocabulary in chars vocab_size = len(vocab) # The embedding dimension embedding_dim = 256 # Number of RNN units rnn_units = 1024 model = tf.keras.Sequential() # 这里是字符embedding,因此是字符集大小*embedding_dim model.add(tf.keras.layers.Embedding(input_dim=vocab_size,output_dim=embedding_dim, batch_input_shape=[BATCH_SIZE,None])) model.add(tf.keras.layers.GRU(units=rnn_units, return_sequences=True, recurrent_initializer='glorot_uniform', stateful=True)) model.add(tf.keras.layers.Dense(units=vocab_size)) model.summary() # 9. 模型配置 # optimizer 必须为 tf.train 下的opt,不能是keras下的opt model.compile(optimizer=tf.train.AdamOptimizer(),loss=tf.losses.sparse_softmax_cross_entropy) # 10 .设置回调函数 checkpoint_dir = './training_checkpoints' checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}") checkpoint_callback=tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True) tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='./logs/') # 11. 训练模型,repeat() 表示dataset无限循环,否则这里数据可能不够30个epochs model.fit(dataset.repeat(),epochs=30, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback,tensorboard_callback]) # 12 .模型保存 # 保存为keras模型格式 model.save_weights(filepath='./models/gen_text_with_char_on_rnn.h5',save_format='h5') # 保存为TensorFlow的格式 model.save_weights(filepath='./models/gen_text_with_char_on_rnn_check_point') # 13. 模型生成文本 def generate_text(model, start_string): # Evaluation step (generating text using the learned model) # Number of characters to generate num_generate = 1000 # You can change the start string to experiment start_string = 'ROMEO' # Converting our start string to numbers (vectorizing) input_eval = [char2idx[s] for s in start_string] input_eval = tf.expand_dims(input_eval, 0) # Empty string to store our results text_generated = [] # Low temperatures results in more predictable text. # Higher temperatures results in more surprising text. # Experiment to find the best setting. temperature = 1.0 # Here batch size == 1 model.reset_states() for i in range(num_generate): predictions = model(input_eval) # remove the batch dimension predictions = tf.squeeze(predictions, 0) # using a multinomial distribution to predict the word returned by the model predictions = predictions / temperature predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy() # We pass the predicted word as the next input to the model # along with the previous hidden state input_eval = tf.expand_dims([predicted_id], 0) text_generated.append(idx2char[predicted_id]) return (start_string + ''.join(text_generated)) print(generate_text(model, start_string="ROMEO: "))
【四】总结
1. 关于tf.keras更多的内容,能够参考官方网站(https://tensorflow.google.cn/guide/keras)
2. 关于tf.dataset的更多内容,能够参考官方网站(https://tensorflow.google.cn/guide/datasets)和另一篇博客(https://my.oschina.net/u/3800567/blog/1637798)
3. 能够彻底使用tf.keras,再也不使用keras。两者功能与接口一致,tf.keras提供了更多的对TensorFlow的支持