FLORIZEL:
Should she kneel be?
In shall not weep received; unleased me
And unrespective greeting than dwell in, thee,
look’d on me, son in heavenly properly.git
这是谁写的,莎士比亚仍是机器学习模型?spring
答案是后者!上面这篇文章是一个通过TensorFlow训练的循环神经网络的产物,通过30个epoch的训练,并给出了一颗“FLORIZEL:”的种子。在本文中,我将解释并给出如何训练神经网络来编写莎士比亚戏剧或任何您但愿它编写的东西的代码!api
首先导入一些基本库数组
import tensorflow as tf import numpy as np import os import time
TensorFlow内置了莎士比亚做品。若是您在像Kaggle这样的在线环境中工做,请确保链接了互联网。网络
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
数据须要用utf-8进行解码。架构
text = open(path_to_file, 'rb').read().decode(encoding='utf-8') # length of text is the number of characters in it print ('Length of text: {} characters'.format(len(text)))
[输出]:app
Length of text: 1115394 charactersless
它里面有不少的数据能够用!dom
咱们看看前250个字符是什么机器学习
print(text[:250])
首先看看文件里面有多少不一样的字符:
vocab = sorted(set(text)) print ('{} unique characters'.format(len(vocab)))
[输出]:
65 unique characters
在训练以前,字符串须要映射到数字表示。
下面建立两个表—一个表将字符映射到数字,另外一个表将数字映射到字符。
char2idx = {u:i for i, u in enumerate(vocab)} idx2char = np.array(vocab) text_as_int = np.array([char2idx[c] for c in text])
查看向量字典:
print('{') for char,_ in zip(char2idx, range(20)): print(' {:4s}: {:3d},'.format(repr(char), char2idx[char])) print(' ...\n}')
[输出]:
{
'\n': 0,
' ' : 1,
'!' : 2,
'$' : 3,
'&' : 4,
"'" : 5,
',' : 6,
'-' : 7,
'.' : 8,
'3' : 9,
':' : 10,
...
}
每个不同的字符都有了编号。
咱们看看向量生成器如何处理做品的前两个单词 'First Citizen'
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))
这些单词被转换成一个数字向量,这个向量能够很容易地经过整数到字符字典转换回文本。
给定一个字符序列,该模型将理想地找到最有可能的下一个字符。
文本将被分红几个句子,每一个输入句子将包含文本中的一个可变的seq_length字符。
任何输入语句的输出都将是输入语句,向右移动一个字符。
例如,给定一个输入“Hell”,输出将是“ello”,从而造成单词“Hello”。
首先,咱们可使用tensorflow的.from_tensor_slices函数将文本向量转换为字符索引。
# The maximum length sentence we want for a single input in characters seq_length = 100 examples_per_epoch = len(text)//(seq_length+1) # Create training examples / targets char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int) for i in char_dataset.take(5): print(idx2char[i.numpy()])
[输出]:
F
i
r
s
t
批处理方法容许这些单个字符成为肯定大小的序列,造成段落片断。
sequences = char_dataset.batch(seq_length+1, drop_remainder=True) for item in sequences.take(5): print(repr(''.join(idx2char[item.numpy()])))
[输出]:
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ' 'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k' "now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki" "ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d" 'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'
对于每一个序列,咱们将复制它并使用map方法移动它以造成一个输入和一个目标。
def split_input_target(chunk): input_text = chunk[:-1] target_text = chunk[1:] return input_text, target_text dataset = sequences.map(split_input_target)
如今,数据集已经变成了咱们想要的输入和输出。
Input data: 'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou' Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
对向量的每一个索引进行一次性处理;对于第0步的输入,模型接收“F”的数值索引,并尝试预测“i”做为下一个字符。在下一个时序步骤中,它作一样的事情,可是RNN不只考虑前面的步骤,并且还考虑它刚才预测的字符。
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])): print("Step {:4d}".format(i)) print(" input: {} ({:s})".format(input_idx, repr(idx2char[input_idx]))) print(" expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
[输出]:
Step 0
input: 18 ('F')
expected output: 47 ('i')
Step 1
input: 47 ('i')
expected output: 56 ('r')
Step 2
input: 56 ('r')
expected output: 57 ('s')
Step 3
input: 57 ('s')
expected output: 58 ('t')
Step 4
input: 58 ('t')
expected output: 1 (' ')
Tensorflow的 tf.data 能够用来将文本分割成更易于管理的序列——但首先,须要将数据打乱并打包成批。
# Batch size BATCH_SIZE = 64 # Buffer size to shuffle the dataset BUFFER_SIZE = 10000 dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True) dataset
[输出]:
<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>
最后,咱们能够构建模型。让咱们先设定一些重要的变量:
# Length of the vocabulary in chars vocab_size = len(vocab) # The embedding dimension embedding_dim = 256 # Number of RNN units rnn_units = 1024
模型将有一个嵌入层或输入层,该层将每一个字符的数量映射到一个具备变量embedding_dim维数的向量。它将有一个GRU层(能够用LSTM层代替),大小为units = rnn_units。最后,输出层将是一个标准的全链接层,带有vocab_size输出。
下面的函数帮助咱们快速而清晰地建立一个模型。
def build_model(vocab_size, embedding_dim, rnn_units, batch_size): model = tf.keras.Sequential([ tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]), tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'), tf.keras.layers.Dense(vocab_size) ]) return model
经过调用函数组合模型架构。
model = build_model( vocab_size = len(vocab), embedding_dim=embedding_dim, rnn_units=rnn_units, batch_size=BATCH_SIZE)
让咱们总结一下咱们的模型,看看有多少参数。
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (64, None, 256) 16640 _________________________________________________________________ gru (GRU) (64, None, 1024) 3938304 _________________________________________________________________ dense (Dense) (64, None, 65) 66625 ================================================================= Total params: 4,021,569 Trainable params: 4,021,569 Non-trainable params: 0 _________________________________________________________________
400万的参数!咱们但愿把它训练的久一点。
这个问题如今能够做为一个分类问题来处理。
给定先前的RNN状态和时间步长的输入,预测表示下一个字符的类。
所以,咱们将附加一个稀疏分类熵损失函数和Adam优化器。
def loss(labels, logits): return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True) example_batch_loss = loss(target_example_batch, example_batch_predictions) print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)") print("scalar_loss: ", example_batch_loss.numpy().mean()) model.compile(optimizer='adam', loss=loss)
[输出]:
Prediction shape: (64, 100, 65) # (batch_size, sequence_length, vocab_size)
scalar_loss: 4.1746616
模型训练,尤为是像莎士比亚戏剧这样的大型数据集,须要很长时间。理想状况下,咱们不会为了作出预测而反复训练它。tf.keras.callbacks.ModelCheckpoint函数能够在训练期间将某些检查点的权重保存到一个文件中,该文件能够在一个空白模型被后续检索。这在训练因任何缘由中断时也很方便。
# Directory where the checkpoints will be saved checkpoint_dir = './training_checkpoints' # Name of the checkpoint files checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}") checkpoint_callback=tf.keras.callbacks.ModelCheckpoint( filepath=checkpoint_prefix, save_weights_only=True)
EPOCHS=30 history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
这应该须要大约6个小时的时间来得到不那么使人印象深入但更快的结果,epochs能够调整到10(任何小于5的都会彻底变成垃圾)。
冲检查点中恢复权重参数
tf.train.latest_checkpoint(checkpoint_dir)
用这些权重参数咱们能够从新构建模型:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1) model.load_weights(tf.train.latest_checkpoint(checkpoint_dir)) model.build(tf.TensorShape([1, None]))
生成文本的步骤:
def generate_text(model, start_string): # Evaluation step (generating text using the learned model) # Number of characters to generate num_generate = 1000 # Converting our start string to numbers (vectorizing) input_eval = [char2idx[s] for s in start_string] input_eval = tf.expand_dims(input_eval, 0) # Empty string to store our results text_generated = [] # Low temperatures results in more predictable text. # Higher temperatures results in more surprising text. # Experiment to find the best setting. temperature = 1.0 # Here batch size == 1 model.reset_states() for i in range(num_generate): predictions = model(input_eval) # remove the batch dimension predictions = tf.squeeze(predictions, 0) # using a categorical distribution to predict the character returned by the model predictions = predictions / temperature predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy() # We pass the predicted character as the next input to the model # along with the previous hidden state input_eval = tf.expand_dims([predicted_id], 0) text_generated.append(idx2char[predicted_id]) return (start_string + ''.join(text_generated))
最后,给定一个开始字符串,咱们能够生成一些有趣的文本。
如今,欣赏一下两个RNN的剧本吧,一个是训练了10个epochs,另外一个是30个epochs。
print(generate_text(model, start_string=u"ROMEO: "))
ROMEO: how I, away too put That you shall have thieffort, are but love.
JULIET: Go, fight, sir: we say ‘Ay,’ and alack to stand and not to go to; And washt us him to-domm. Ay, my ows young; a man hear from his monsher to thee.
KING RICHARD III: Come, cease. O broteld the costime’s deforment! Thou wilt was quite.
PAULINA: I would you say the hour! Ah, hole for your company: But, good my lord; we have a king, of peace?
BALTHASAR: Cadul and washee could he ha! To curit her I may wench.
GLOUCESTER: Had you here shall such a pierce to temper; Or might his noble offery owe and speed Which seemest thy trims in a weaky amidude By this to the dother, dods citizens.
Third Citizen:
Madam sweet give reward, rebeire them With news gone! Pluck yielding: ’tis sign out things Within risess in strifes all ten times, To dish his finmers for briefily.
JULIET:
Gentlemen, God eveI come approbouting his wife as it, — triumphrous night change you gods, thou goest:
To which will dispersed and France.
哇!仅仅在10个epochs以后,就有了使人印象深入的理解。这些词的拼写准确性使人怀疑,但其中有明显的情节冲突。写做确定能够改进。但愿30-epoch模型能有更好的表现。
欣赏一下彻底由RNN一个字一个字地创做出来的做品吧!
BRUTUS:
Could you be atherveshed him, our two,
But much a tale lendly fear;
For which we in thy shade of Naples.
Here’s no increase False to’t, offorit is the war of white give again.
This is the queen, whose vanoar’s head is worthly.
But cere it be a witch, some comfort.
What, nurse, I say!
Go Hamell.
FLORIZEL:
Should she kneel be?
In shall not weep received; unleased me
And unrespective greeting than dwell in, thee,
look’d on me, son in heavenly properly,
That ever you are my father is but straing;
Unless you would repossess him, hath always louded up,
You provokest. Good faith, o’erlar I can repart the heavens like deeds dills
For temper as soon as another maiden here, and he is bann’d upon which springs;
O’er most upon your voysus, I have no thunder; and my good villain!
Alest each other’s sleepings.
A fool; if this business prating duty
Does these traitors other sorrow.
LUCENTIO:
Tell me, they’s honourably.
Shepherd:
I know, my lord, to London, and you my moved join under him,
Great Apollo’s stan to make a book,
Both yet my father away towards Covent. Tut, And thou still’d by the earthmen lord r sensible your mother?
Servant:
Go, vill! We muster yet, for you’ll not: you are took good mad within your company in rage, I would you fight it so, his eye for every days,
To swear the beam of such a detects,
To Clarence dead to call upon you all I thank your grace, my father and my father, and yourself prevails
My father, hath a sword for hither;
Nor when thy heart is grown grave done.
QUEEN MARGARET: *
*Thou art a lodging very good and give thanks
With him.
But There is now in hand:
Therefore it be possish’d with Romeo dead.
MENENIUS:
Ha! little very welcome to my daughter’s sword,
Which haply my prayer’s legs, such as he does.
I am banks, sir, I’ll make you say ‘nough; for hither so better now to be so, sent it: it is stranger.
哇!有趣的是,这个模型甚至学会了在某些状况下押韵(特别是Florizel的台词)。想象一下,在50甚至100个epochs以后,RNN能写些什么!
不彻底是这样——但我能够想象将来人工智能会发表大量设计成病毒式传播的文章。这是一个挑战——收集与主题相关的顶级文章,好比Human Parts或其余相似出版物的文章,而后训练人工智能撰写热门文章。发布RNN的输出,逐字地,看看效果如何!注意——我不建议在更专业的出版物上训练RNN,好比Towards Data Science 或 Better Programming,由于它须要RNN在合理的时间内没法学习的技术知识。然而,在RNN目前的能力范围内,更多的哲学和非技术的写做还行。
随着文本生成变得愈来愈先进,它将有潜力比人类写得更好,由于它有一个眼睛,什么内容将像病毒同样,什么措推辞读者感受良好,等等。使人震惊的是,有一天,机器能够在人类最擅长的事情——写做上击败人类。诚然,它没法真正理解本身在写什么,但它会掌握人类的交流方式。
我想若是你不能战胜他们,那就加入他们吧!
原文地址:https://imba.deephub.ai/p/051053806a5211ea90cd05de3860c663