Deep learning with Python 学习笔记（5）

时间 2020-05-04

标签 deep learning python 学习笔记栏目 Python 繁體版

原文原文链接

本节讲深度学习用于文本和序列html

用于处理序列的两种基本的深度学习算法分别是循环神经网络（recurrent neural network）和一维卷积神经网络（1D convnet）
与其余全部神经网络同样，深度学习模型不会接收原始文本做为输入，它只能处理数值张量。文本向量化（vectorize）是指将文本转换为数值张量的过程。它有多种实现方法算法

将文本分割为单词，并将每一个单词转换为一个向量
将文本分割为字符，并将每一个字符转换为一个向量
提取单词或字符的 n-gram，并将每一个 n-gram 转换为一个向量。n-gram 是多个连续单词或字符的集合（n-gram 之间可重叠）

将文本分解而成的单元（单词、字符或 n-gram）叫做标记（token），将文本分解成标记的过程叫做分词（tokenization）。全部文本向量化过程都是应用某种分词方案，而后将数值向量与生成的标记相关联。这些向量组合成序列张量，被输入到深度神经网络中数据库

n-gram 是从一个句子中提取的 N 个（或更少）连续单词的集合。这一律念中的“单词”也能够替换为“字符”
The cat sat on the mat 分解为二元语法(2-gram)的集合
{"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}
分解为三元语法(3-gram)的集合
{"The", "The cat", "cat", "cat sat", "The cat sat",
"sat", "sat on", "on", "cat sat on", "on the", "the",
"sat on the", "the mat", "mat", "on the mat"} 这样的集合分别叫做二元语法袋（bag-of-2-grams）及三元语法袋（bag-of-3-grams）。这里袋（bag）这一术语指的是，咱们处理的是标记组成的集合。这一系列分词方法叫做词袋（bag-of-words）。词袋是一种不保存顺序的分词方法，所以它每每被用于浅层的语言处理模型，而不是深度学习模型网络

将向量与标记相关联的方法
对标记作 one-hot 编码（one-hot encoding）与标记嵌入［token embedding，一般只用于单词，叫做词嵌入（word embedding）］app

one-hot 编码是将标记转换为向量的最经常使用、最基本的方法dom

它将每一个单词与一个惟一的整数索引相关联，而后将这个整数索引 i 转换为长度为 N 的二进制向量（N 是词表大小），这个向量只有第 i 个元素是 1，其他元素都为 0 (也能够进行字符级的 one-hot 编码)机器学习

Keras one-hot编码Demo函数

from keras.preprocessing.text import Tokenizer


samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# 只考虑前1000个最多见的单词
tokenizer = Tokenizer(num_words=1000)
# 构建单词索引
tokenizer.fit_on_texts(samples)
# 找回单词索引
word_index = tokenizer.word_index
print(word_index)
# 将字符串转换为整数索引组成的列表
sequences = tokenizer.texts_to_sequences(samples)
print("转换成的索引序列 ", sequences)
text = tokenizer.sequences_to_texts(sequences)
print("转会的文本 ", text)
# 获得 one-hot 二进制表示
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
one_num = 0
for items in one_hot_results:
    for item in items:
        if item == 1:
            one_num += 1
print("1的数量为 ", one_num)
print(one_hot_results)

结果
学习

one-hot 编码的一种变体是所谓的 one-hot 散列技巧（one-hot hashing trick），若是词表中惟一标记的数量太大而没法直接处理，就可使用这种技巧测试

将单词散列编码为固定长度的向量，一般用一个很是简单的散列函数来实现

这种方法的主要优势在于，它避免了维护一个显式的单词索引，从而节省内存并容许数据的在线编码，缺点就是可能会出现散列冲突

词嵌入
one-hot 编码获得的向量是二进制的、稀疏的、维度很高的（维度大小等于词表中的单词个数），而词嵌入是低维的浮点数向量。与 one-hot 编码获得的词向量不一样，词嵌入是从数据中学习获得的。常见的词向量维度是 25六、512 或 1024（处理很是大的词表时）。与此相对，one hot 编码的词向量维度一般为 20 000 或更高。所以，词向量能够将更多的信息塞入更低的维度中

获取词嵌入有两种方法

在完成主任务（好比文档分类或情感预测）的同时学习词嵌入。在这种状况下，一开始是随机的词向量，而后对这些词向量进行学习，其学习方式与学习神经网络的权重相同
在不一样于待解决问题的机器学习任务上预计算好词嵌入，而后将其加载到模型中。这些词嵌入叫做预训练词嵌入（pretrained word embedding）

利用 Embedding 层学习词嵌入
词嵌入的做用应该是将人类的语言映射到几何空间中，咱们但愿任意两个词向量之间的几何距离）应该和这两个词的语义距离有关。可能还但愿嵌入空间中的特定方向也是有意义的
Embedding 层的输入是一个二维整数张量，其形状为 (samples, sequence_length)，它可以嵌入长度可变的序列，不过一批数据中的全部序列必须具备相同的长度

简单Demo

from keras.datasets import imdb
from keras import preprocessing
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
import matplotlib.pyplot as plt


max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features, path='E:\\study\\dataset\\imdb.npz')
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.2)


acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

结果

当可用的训练数据不多，以致于只用手头数据没法学习适合特定任务的词嵌入，你能够从预计算的嵌入空间中加载嵌入向量，而不是在解决问题的同时学习词嵌入。有许多预计算的词嵌入数据库，你均可如下载并在 Keras 的 Embedding 层中使用，word2vec 就是其中之一。另外一个经常使用的是 GloVe（global vectors for word representation，词表示全局向量）

没有足够的数据来本身学习真正强大的特征，但你须要的特征应该是很是通用的，好比常见的视觉特征或语义特征

新闻情感分类Demo，使用GloVe预训练词

import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
import matplotlib.pyplot as plt


imdb_dir = 'E:\\study\\dataset\\aclImdb'
train_dir = os.path.join(imdb_dir, 'train')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
# 对 IMDB 原始数据的文本进行分词
maxlen = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
# 打乱数据
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

# 　解析 GloVe 词嵌入文件
glove_dir = 'E:\\study\\models\\glove.6B'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))

# 准备 GloVe 词嵌入矩阵(max_words, embedding_dim)
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# 　模型定义
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# 将预训练的词嵌入加载到 Embedding 层中，并冻结
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
# 训练与评估
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

# 对测试集数据进行分词
test_dir = os.path.join(imdb_dir, 'test')
labels = []
texts = []
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)
# 在测试集上评估模型
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

数据下的时间太长放弃了，233

Deep learning with Python 学习笔记（6） Deep learning with Python 学习笔记（4）