这个模型最先被应用于医疗诊断,其中,类变量的不一样值用于表示患者可能患的不一样疾病。证据变量用于表示不一样症状、化验结果等。在简单的疾病诊断上,朴素贝叶斯模型确实发挥了很好的做用,甚至比人类专家的诊断结果都要好。可是在更深度的应用中,医生发现,对于更复杂(由多种致病缘由和症状共同表现)的疾病,模型表现的并很差。html
数据科学家通过分析认为,出现这种现象的缘由在于:模型作了集中一般并不真实的强假设,例如:网络
这种模型可用于医学诊断是由于少许可解释的参数易于由专家得到,早期的机器辅助医疗诊断系统正式创建在这一技术之上。dom
可是,以后更加深刻的实践代表,构建这种模型的强假设下降了模型诊断的准确性,尤为是“过分计算”某些特定的证据,该模型很容易太高估计某些方面特征的影响。机器学习
例如,“高血压”和“肥胖症”是心脏病的两个硬指标,可是,这两个症状之间相关度很高,高血压通常就伴随着肥胖症。在使用朴素贝叶斯公式计算的时候,因为乘法项的缘故,关于这方面的证据因子就会被重复计算,以下式:ide
P(心脏病 | 高血压,肥胖症) = P(高血压 | 心脏病) * P(高血压 | 肥胖症) / P(高血压,肥胖症)函数
因为“高血压”和“肥胖症”之间存在较强相关性的缘故,咱们能够很容易想象,分子乘积增长的比率是大于分母联合分布增长的比率的。所以,当分子项继续增长的时候,最终的后验几率就会不断增大。可是由于新增的特征项并无提供新的信息,后验几率的这种增大变化反而下降了模型的预测性能。性能
实际上,在实践中人们发现,朴素贝叶斯模型的诊断性能会随着特征的增长而下降,这种下降经常归因于违背了强条件独立性假设。学习
笔者将这种现象称之为“过分特征化(over-featuring)”,这是工程中常见的一种现象,过分特征化若是没法获得有效规避,会显著下降模型的泛化和预测性能。在这篇文章中,咱们经过实验和分析来论证这个说法。测试
能够经过这4个特征预测鸢尾花卉属于(iris-setosa, iris-versicolour, iris-virginica)中的哪一品种。spa
咱们先来讨论欠特征化(under-featuring)的状况,咱们的数据集中有4个维度的特征,而且这4个特征和目标target的相关度都是很高的,换句话说这4个特征都是富含信息量的特征:
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffleif __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # load origin feature X_train_vec = X_train[:, :4] X_test_vec = X_test[:, :4] print "Pearson Relevance X[0]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 0:1]]), Y_train)[0, 1] print "Pearson Relevance X[1]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 1:2]]), Y_train)[0, 1] print "Pearson Relevance X[2]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 2:3]]), Y_train)[0, 1] print "Pearson Relevance X[3]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 3:4]]), Y_train)[0, 1]
4个特征的皮尔森相关度都超过了0.5
如今咱们分别尝试只使用1个、2个、3个、4个特征状况下,训练获得的朴素贝叶斯模型的泛化和预测性能:
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle def model_tain_and_test(feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # train and test the generalization and prediction model_tain_and_test(1) model_tain_and_test(2) model_tain_and_test(3) model_tain_and_test(4)
能够看到,只使用1个特征的时候,在测试集上的预测精确度只有33.3%,随着特征数的增长,测试集上的预测精确度逐渐增长。
用贝叶斯网的角度来看朴素贝叶斯模型,有以下结构图,
Xi节点这里至关于特征,网络中每一个Xi节点的增长,都会改变对Class结果的几率推理,Xi越多,推理的准确度也就越高。
从信息论的角度也很好理解,咱们能够将P(Class | Xi)当作是条件熵的信息传递过程,咱们提供的信息越多,原则上,对Class的不肯定性就会越低。
至此,咱们得出以下结论:
特征工程过程当中须要特别关注描述完整性问题(description integrity problem),特征维度没有完整的状况下,提供再多的数据对模型效果都没有实质的帮助。样本集的几率完整性要从“特征完整性”和“数据完整性”两个方面保证,它们两者归根结底仍是信息完整性的本质问题。
如今咱们在原始的4个特征维度上,继续增长新的无用特征,即那种和目标target相关度很低的特征。
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle import random def feature_expend(feature_vec): # colum_1 * colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 1])]))) # random from colum_1 feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) # random from colum_2 feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) return feature_vec def model_tain_and_test(X_train, X_test, Y_train, Y_test, feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # expend feature X_train = feature_expend(X_train) X_test = feature_expend(X_test) print "X_test: ", X_test # show Pearson Relevance for i in range(len(X_train[0])): print "Pearson Relevance X[{0}]: ".format(i), numpy.corrcoef(np.array([i[0] for i in X_train[:, i:i+1]]), Y_train)[0, 1] model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0]))
咱们用random函数模拟了一个无用的新特征,能够看到,无用的特征对模型不但没有帮助,反而下降了模型的性能。
至此,咱们得出以下结论:
特征不是越多越多,机器学习不是洗衣机,一股脑将全部特征都丢进去,而后双手合十,期望着模型能施展魔法,自动筛选出有用的好特征,固然,dropout/正则化这些手段确实有助于提升模型性能,它们的工做本质也是经过去除一些特征,从而缓解垃圾特征对模型带来的影响。
固然,将来也许会发展出autoFeature的工程技术,可是做为数据科学工做者,咱们本身必需要理解特征工程的意义。
所谓的“特征加工”,具体来讲就是对原始的特征进行线性变换(拉伸和旋转),获得新的特征,例如:
本质上来讲,咱们能够将深度神经网络的隐层看作是一种特征加工操做,稍有不一样的是,深度神经网络中激活函数充当了非线性扭曲的做用,不过其本质思想仍是不变的。
那接下来问题是,特征加工对模型的性能有没有影响呢?
准确的回答是,特征加工对模型的影响取决于新增特征的相关度,以及坏特征在全部特征中的占比。
咱们来经过几个实验解释上面这句话,下面笔者先经过模拟出几个典型场景,最终给出总结结论:
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle def feature_expend(feature_vec): # colum_1 * colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_1 / colum_2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.divide(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_3 * colum_4 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 3])]))) # colum_4 * colum_1 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 0])]))) # colum_1 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 0])]))) # colum_2 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 1], feature_vec[:, 1])]))) # colum_3 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 2])]))) # colum_4 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 3])]))) return feature_vec def model_tain_and_test(X_train, X_test, Y_train, Y_test, feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # expend feature X_train = feature_expend(X_train) X_test = feature_expend(X_test) print "X_test: ", X_test # show Pearson Relevance for i in range(len(X_train[0])): print "Pearson Relevance X[{0}]: ".format(i), numpy.corrcoef(np.array([i[0] for i in X_train[:, i:i+1]]), Y_train)[0, 1] model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0])-1) model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0]))
上面代码中,咱们新增了一个“colum_1 * colum_2”特征维度,而且打印了该特征的皮尔森相关度,相关度只有0.15,这是一个不好的特征。同时该坏特征占了总特征的1/5比例,是一个不低的比例。
所以在这种状况下,模型的检出效果受到了影响,降低了。缘由以前也解释过,坏的特征由于累乘效应,影响了最终的几率值。
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle def feature_expend(feature_vec): # colum_1 * colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_1 / colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.divide(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_3 * colum_4 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 3])]))) # colum_4 * colum_1 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 0])]))) # colum_1 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 0])]))) # colum_2 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 1], feature_vec[:, 1])]))) # colum_3 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 2])]))) # colum_4 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 3])]))) return feature_vec def model_tain_and_test(X_train, X_test, Y_train, Y_test, feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # expend feature X_train = feature_expend(X_train) X_test = feature_expend(X_test) print "X_test: ", X_test # show Pearson Relevance for i in range(len(X_train[0])): print "Pearson Relevance X[{0}]: ".format(i), numpy.corrcoef(np.array([i[0] for i in X_train[:, i:i+1]]), Y_train)[0, 1] model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0]))
在这个场景中,“colum_1 * colum_2”这个坏特征依然存在,但和上一个场景不一样的是,除了这个坏特征以外,新增的特征都是好特征(相关度都很高)。
根据简单的乘积因子原理能够很容易理解,这个坏特征对最终几率数值的影响会被“稀释”,从而下降了对模型性能的影响。
至此,咱们得出以下结论:
深度神经网络的隐层结构大规模增长了特征的数量。本质上,深度神经网络经过矩阵线性变换和非线性激活函数获得海量的特征维度的组合。咱们能够想象到,其中必定有好特征(相关度高),也必定会有坏特征(相关度低)。
可是有必定咱们能够肯定,好特征的出现几率确定是远远大于坏特征的,由于全部的特征都是从输入层的好特征衍生而来的(遗传进化思想)。那么当新增特征数量足够多的时候,从几率上就能够证实,好特征的影响就会远远大于坏特征,从而消解掉了坏特征对模型性能的影响。这就是为何深度神经网络的适应度很强的缘由之一。
用一句通俗的话来讲就是:若是你有牛刀,杀鸡为啥不用牛刀?用牛刀杀鸡的好处在于,无论来的是鸡仍是牛,都能自适应地保证能杀掉。
冗余特征和过特征化现象在机器学习模型中并不罕见,在不一样的模型中有不一样的表现形式,例如:
这里列举一些笔者在工程实践中总结出的一些指导性原则: