sklearn半监督学习

时间 2019-12-07

标签 sklearn 监督学习繁體版

原文原文链接

标签：半监督学习html

做者：炼己者
欢迎你们访问个人简书以及个人博客
本博客全部内容以学习、研究和分享为主，如需转载，请联系本人，标明做者和出处，而且是非商业用途，谢谢！
---git

摘要：半监督学习很重要，为何呢？由于人工标注数据成本过高，如今你们参加比赛的数据都是标注好的了，那么若是老板给你一份没有标注的数据，并且有几百万条，让你作个分类什么的，你怎么办？不可能等标注好数据再去训练模型吧，因此你得会半监督学习算法。算法

不过我在这里先打击你们一下，用sklearn的包作不了大数据量的半监督学习，我用的数据量大概在15000条以上就要报MemoryError错误了，这个是我最讨厌的错误。暂时我尚未解决的办法，若是同志们是小数据量，那就用这个作着玩玩吧。你们若是有兴趣也能够看一下这篇文章——用半监督算法作文本分类apache

报MemoryError错误怎么办？sklearn提供这么全的文档固然会有这部分的考虑啦。看这里——sklearn 中的模型对于大数据集的处理。能够用partial_fit增量式计算，惋惜只针对部分算法，对于半监督学习没有办法。数组

好了，该说正题了，最近看了sklearn关于半监督学习的例子，它里面有三个例子，在这里我主要想分享一下第三个例子——用半监督学习算法作数字识别dom

一. 数据集的解读

首先咱们来看一下这份数据集的特色
函数

二. 代码的解读

sklearn官方例子——用半监督学习作数字识别学习

咱们来看一下操做流程测试

一共330个点，都是已经标注好的了，咱们把其中的320个点赋值为-1，这样就能够伪装这320个点都是没有标注的了

训练一个只有10个标记点的标签传播模型

而后从全部数据中选择要标记的前五个最不肯定的点,把它们(带有正确标签)放到原来的10个点中

接下来能够训练15个标记点(原始10个 + 5个新点)

重复这个过程四次，就可使用30个标记好的点来训练模型

能够经过改变max_iterations将这个值增长到30以上

以上是sklearn的操做流程，你们可能会有点糊涂
实际任务应该是这样的。假设咱们有一份数据集，共330个数字，其中前十个是已知的，已经标注好了，后320个是未知的，须要咱们预测出来的。大数据

首先把这330个数据所有都放到半监督学习算法里，训练模型，预测那320个标签

而后用某种方法(看下面代码的操做)得知这320个数据里最不肯定的前5个数据，对它进行人工标注，而后把它放到以前的10个数据里，如今就有15个已知数据了

这样循环个几回，已标注的数据就变多了，那么分类器的效果确定也就变好了

1.导入各类数据包

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
from sklearn.metrics import classification_report,confusion_matrix

# 再加下面这个，否则会报错
from scipy.sparse.csgraph import *

2.读取数据集

digits = datasets.load_digits()
rng = np.random.RandomState(0)
# indices是随机产生的0-1796个数字，且打乱
indices = np.arange(len(digits.data))
rng.shuffle(indices)

# 取前330个数字来玩
X = digits.data[indices[:330]]
y = digits.target[indices[:330]]
images = digits.images[indices[:330]]

n_total_samples = len(y) # 330
n_labeled_points = 10 # 标注好的数据共10条
max_iterations = 5 # 迭代5次

unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:] # 未标注的数据320条
f = plt.figure() # 画图用的

3. 训练模型且画图

建议你们把本身不懂的地方打印出来看看是啥意思，好比下面

for i in range(max_iterations):
    if len(unlabeled_indices) == 0:
        print("no unlabeled items left to label") # 没有未标记的标签了，所有标注好了
        break
    y_train = np.copy(y)
    y_train[unlabeled_indices] = -1 #把未标注的数据所有标记为-1，也就是后320条数据
    
    lp_model = label_propagation.LabelSpreading(gamma=0.25,max_iter=5) # 训练模型
    lp_model.fit(X,y_train)
    
    predicted_labels = lp_model.transduction_[unlabeled_indices] # 预测的标签
    true_labels = y[unlabeled_indices] # 真实的标签
    
    cm = confusion_matrix(true_labels,predicted_labels,
                         labels = lp_model.classes_)
    
    print("预测标签")
    print(predicted_labels)
    print("真实标签")
    print(true_labels)
    print('----------------------------------------------')

经对比发现预测的标签只有7个类，而非10个类

缘由就是咱们一开始训练的那10个数据只有7个类，因此预测其余320条数据的时候只能预测出这7个类

预测标签
[2 8 6 6 6 6 1 9 5 8 8 2 8 7 7 6 7 9 2 9 7 7 6 8 9 1 8 1 9 1 1 6 7 7 9 9 7
 6 2 1 9 6 7 9 9 9 9 1 6 9 9 2 8 7 2 9 2 6 9 1 8 9 5 1 2 1 2 2 9 7 2 8 6 9
 9 8 7 5 1 2 9 9 8 1 7 7 1 1 6 1 5 9 2 6 8 9 2 1 7 7 9 7 8 9 7 5 8 2 1 9 2
 9 8 1 1 7 9 6 1 5 8 9 9 6 9 9 5 7 9 6 2 8 6 9 6 1 5 1 5 9 9 1 8 9 6 1 8 9
 1 7 6 7 6 5 6 9 8 8 9 8 6 1 9 7 2 6 8 8 6 7 1 9 6 9 9 8 9 8 9 7 7 9 7 8 9
 7 8 9 6 7 5 9 1 7 6 1 9 8 9 9 9 9 2 1 1 2 1 1 1 9 2 1 9 8 7 6 1 8 8 1 6 9
 9 6 9 2 2 9 7 6 1 1 9 7 2 7 8 6 6 7 5 2 8 7 2 7 9 5 7 9 9 2 6 5 9 7 1 8 8
 9 8 6 7 6 9 2 6 1 8 8 1 6 7 5 2 1 5 8 2 1 6 9 1 5 7 9 1 6 2 9 9 1 2 2 9 9
 6 9 7 2 9 7 5 8 6 7 8 2 8 7 9 7 2 6 5 1 5 1 9 8]
真实标签
[2 8 6 6 6 6 1 0 5 8 8 7 8 4 7 5 4 9 2 9 4 7 6 8 9 4 3 1 0 1 8 6 7 7 1 0 7
 6 2 1 9 6 7 9 0 0 5 1 6 3 0 2 3 4 1 9 2 6 9 1 8 3 5 1 2 8 2 2 9 7 2 3 6 0
 5 3 7 5 1 2 9 9 3 1 7 7 4 8 5 8 5 5 2 5 9 0 7 1 4 7 3 4 8 9 7 9 8 2 6 5 2
 5 8 4 8 7 0 6 1 5 9 9 9 5 9 9 5 7 5 6 2 8 6 9 6 1 5 1 5 9 9 1 5 3 6 1 8 9
 8 7 6 7 6 5 6 0 8 8 9 8 6 1 0 4 1 6 3 8 6 7 4 5 6 3 0 3 3 3 0 7 7 5 7 8 0
 7 8 9 6 4 5 0 1 4 6 4 3 3 0 9 5 9 2 1 4 2 1 6 8 9 2 4 9 3 7 6 2 3 3 1 6 9
 3 6 3 2 2 0 7 6 1 1 9 7 2 7 8 5 5 7 5 2 3 7 2 7 5 5 7 0 9 1 6 5 9 7 4 3 8
 0 3 6 4 6 3 2 6 8 8 8 4 6 7 5 2 4 5 3 2 4 6 9 4 5 4 3 4 6 2 9 0 1 7 2 0 9
 6 0 4 2 0 7 9 8 5 4 8 2 8 4 3 7 2 6 9 1 5 1 0 8]
----------------------------------------------

3.1 完整代码

你们也能够上官网看看最终打印的结果
用半监督学习作数字识别

for i in range(max_iterations):
    if len(unlabeled_indices) == 0:
        print("no unlabeled items left to label") # 没有未标记的标签了，所有标注好了
        break
    y_train = np.copy(y)
    y_train[unlabeled_indices] = -1 #把未标注的数据所有标记为-1，也就是后320条数据
    
    lp_model = label_propagation.LabelSpreading(gamma=0.25,max_iter=5) # 训练模型
    lp_model.fit(X,y_train)
    
    predicted_labels = lp_model.transduction_[unlabeled_indices] # 预测的标签
    true_labels = y[unlabeled_indices] # 真实的标签
    
    cm = confusion_matrix(true_labels,predicted_labels,
                         labels = lp_model.classes_)
    
    print("iteration %i %s" % (i,70 * "_")) # 打印迭代次数
    print("Label Spreading model: %d labeled & %d unlabeled (%d total)"
         % (n_labeled_points,n_total_samples-n_labeled_points,n_total_samples))
    
    print(classification_report(true_labels,predicted_labels))
    
    print("Confusion matrix")
    print(cm)
    
    # 计算转换标签分布的熵
    # lp_model.label_distributions_做用是Categorical distribution for each item
    pred_entropies = stats.distributions.entropy(
    lp_model.label_distributions_.T)
    
    # 选择分类器最不肯定的前5位数字的索引
    # 首先计算出全部的熵，也就是不肯定性，而后从320个中选择出前5个熵最大的
    # numpy.argsort(A)提取排序后各元素在原来数组中的索引。具体状况可看下面
    #  np.in1d 用于测试一个数组中的值在另外一个数组中的成员资格，返回一个布尔型数组。具体状况可看下面
    uncertainty_index = np.argsort(pred_entropies)[::1]
    uncertainty_index = uncertainty_index[
        np.in1d(uncertainty_index,unlabeled_indices)][:5] # 这边能够肯定每次选前几个做为不肯定的数，最终都会加回到训练集
    
    # 跟踪咱们得到标签的索引
    delete_indices = np.array([])
    
    # 可视化前5次的结果
    if i < 5:
        f.text(.05,(1 - (i + 1) * .183),
              'model %d\n\nfit with\n%d labels' %
              ((i + 1),i*5+10),size=10)
    for index,image_index in enumerate(uncertainty_index):
        # image_index是前5个不肯定标签
        # index就是0-4
        image = images[image_index]

        # 可视化前5次的结果
        if i < 5:
            sub = f.add_subplot(5,5,index + 1 + (5*i))
            sub.imshow(image,cmap=plt.cm.gray_r)
            sub.set_title("predict:%i\ntrue: %i" % (
                lp_model.transduction_[image_index],y[image_index]),size=10)
            sub.axis('off')
        
        # 从320条里删除要那5个不肯定的点
        # np.where里面的参数是条件，返回的是知足条件的索引
        delete_index, = np.where(unlabeled_indices == image_index)
        delete_indices = np.concatenate((delete_indices,delete_index))
        
    unlabeled_indices = np.delete(unlabeled_indices,delete_indices)
    # n_labeled_points是前面不肯定的点有多少个被标注了
    n_labeled_points += len(uncertainty_index)
    
f.suptitle("Active learning with label propagation.\nRows show 5 most"
          "uncertain labels to learn with the next model")
plt.subplots_adjust(0.12,0.03,0.9,0.8,0.2,0.45)
plt.show()

3.2 numpy.argsort()函数

提取排序后各元素在原来数组中的索引

import numpy as np
B=np.array([[4,2,3,55],[5,6,37,8],[-7,68,9,0]])
print('B:')
print(B)

print('')
print('默认输出')
print(np.argsort(B))#默认的输出每行元素的索引值。这些索引值对应的元素是从小到大排序的。

看打印的结果

B:
[[ 4  2  3 55]
 [ 5  6 37  8]
 [-7 68  9  0]]

默认输出
[[1 2 0 3]
 [0 1 3 2]
 [0 3 2 1]]

3.3 np.in1d() 函数

用于测试一个数组中的值在另外一个数组中的成员资格，返回一个布尔型数组

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

看打印的结果

array([ True, False, False,  True,  True, False,  True])

三. 总结

此次主要是想用半监督学习算法作NLP文本分类，看到sklearn库里正好有这个算法包，想拿来试一下，结果跑不了那么大的数据量，算是失败了。可是我以为仍是从中了解了不少，后面会写一篇关于它的博客，里面关于文本的处理让我学到了不少，走了不少的弯路。接下来我还会继续探索怎么用少标注的数据来作文本分类。