前面几篇分享了PCA和SVD降维,今天分享一下另外一种降维方式-自编码降维(auto encoding)。python
自编码是一种数据压缩算法,编码器和解码器特色是:算法
自编码由两个重要组成部分组成:编码器,解码器。 网络结构以下: 网络
首先自编码不能用来压缩图片,自编码只是压缩数据维度,让这些维度的数据也能表明原数据,而图片压缩的目的是抽取一些像素,看上去跟原来图片尽可能同样。看上去像,跟数据上的表明性是两码事,不彻底同样。举个例子,下面的图,上面色块和下面色块颜色同样吗?对计算机就是同样的。app
那自编码能用来干啥呢,如下内容来自百度百科,被应用于降维(dimensionality reduction)和异常值检测(anomaly detection)。包含卷积层构筑的自编码器可被应用于计算机视觉问题,包括图像降噪(image denoising)、神经风格迁移(neural style transfer)等。其实我只关心降维~dom
这里的数据依然是ktv app的用户数据,一共是3748个用户,16692个歌曲记录。函数
song_hot_matrix.shape # (3748, 16692)
训练的目的是,判断用户是男是女。学习
decades_hot_matrix.shape # (3748, 2)
这里编码维度是500维,其实我试了试300维准确度也同样。测试
import os # 禁用gpu os.environ["CUDA_VISIBLE_DEVICES"] = "-1" from keras.layers import Input, Dense from keras.models import Model from sklearn.model_selection import train_test_split train_X,test_X, train_Y, test_Y = train_test_split(song_hot_matrix, decades_hot_matrix, test_size = 0.2, random_state = 0) encoding_dim = 500 input_matrix = Input(shape=(song_hot_matrix.shape[1],)) encoded = Dense(encoding_dim, activation='relu')(input_matrix) decoded = Dense(song_hot_matrix.shape[1], activation='sigmoid')(encoded) autoencoder = Model(input_matrix, decoded) autoencoder.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # 训练 embeding_model = autoencoder.fit(train_X, train_X, epochs=50, batch_size=256, shuffle=True, validation_data=(test_X, test_X)) # 获得一个编码器 encoder = Model(input_img, encoded)
结果:优化
# 2998/2998 [==============================] - 2s 679us/step - loss: 0.0092 - accuracy: 0.9984 - val_loss: 0.0316 - val_accuracy: 0.9982 Epoch 49/50 # 2998/2998 [==============================] - 2s 675us/step - loss: 0.0091 - accuracy: 0.9984 - val_loss: 0.0313 - val_accuracy: 0.9982 Epoch 50/50 # 2998/2998 [==============================] - 2s 694us/step - loss: 0.0090 - accuracy: 0.9984 - val_loss: 0.0312 - val_accuracy: 0.9982
从解码的损失上看还能够。编码
# 编码训练数据和测试数据 train_X_em = encoder.predict(train_X) test_X_em = encoder.predict(test_X) # 用逻辑回归训练判断性别模型 from keras.models import Sequential from keras.layers import Dense, Activation, Embedding,Flatten,Dropout train_count = np.shape(train_X)[0] # 构建神经网络模型 model = Sequential() model.add(Dense(input_dim=train_X_em.shape[1], units=train_Y.shape[1])) model.add(Activation('softmax')) # 选定loss函数和优化器 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(train_X_em, train_Y, epochs=250, batch_size=256, shuffle=True, validation_data=(test_X_em, test_Y))
结果:
# 2998/2998 [==============================] - 0s 13us/step - loss: 0.4151 - accuracy: 0.8266 - val_loss: 0.4127 - val_accuracy: 0.8253 Epoch 248/250 # 2998/2998 [==============================] - 0s 13us/step - loss: 0.4149 - accuracy: 0.8195 - val_loss: 0.4180 - val_accuracy: 0.8413 Epoch 249/250 # 2998/2998 [==============================] - 0s 13us/step - loss: 0.4163 - accuracy: 0.8225 - val_loss: 0.4131 - val_accuracy: 0.8293 Epoch 250/250 # 2998/2998 [==============================] - 0s 13us/step - loss: 0.4152 - accuracy: 0.8299 - val_loss: 0.4142 - val_accuracy: 0.8293
def pred(song_list=[]): blong_hot_matrix = song_label_encoder.encode_hot_dict({"bblong":song_list}, True) blong_hot_matrix = encoder.predict(blong_hot_matrix) y_pred = model.predict_classes(blong_hot_matrix) return user_decades_encoder.decode_list(y_pred) print(pred(["一路向北", "暗香", "菊花台"])) print(pred(["不要说话", "平凡之路", "李白"])) print(pred(["满足", "被风吹过的夏天", "龙卷风"])) print(pred(["情人","再见","无赖","离人","你的样子"])) print(pred(["小情歌","我好想你","无与伦比的美丽"])) print(pred(["忐忑","最炫民族风","小苹果"])) print(pred(["青春修炼手册","爱出发","宠爱","魔法城堡","样"]))
结果
['男'] ['男'] ['男'] ['男'] ['女'] ['男'] ['女']
上面使用了另外一种数据压缩方式-自编码,对数据进行压缩,从结果上看500维特征是能够表明数据总体特征的。