模型评估和验证 Model Evaluation and Validation

时间 2020-01-23

标签模型评估验证 model evaluation validation 繁體版

原文原文链接

训练集和测试集

为何要分训练集和测试集

在机器学习中，咱们通常要将数据分为训练集和测试集。在训练集上训练模型，而后在测试集上测试模型。咱们训练模型的目的是用训练好的模型帮助咱们在后续的实践中作出准确的预测，因此咱们但愿模型可以在从此的实际使用中有很好的性能，而不是只在训练集上有良好的性能。若是模型在学习中过于关注训练集，那就会只是死记硬背地将整个训练集背下来，而不是去理解数据集的内在结构。这样的模型可以对训练集有很是好的掌握，但对训练集外没有记忆过的数据毫无判断能力。这就和人在学习中只去背诵题目答案，而不去理解解题思路同样。这种学习方法是没法在实际工做中取得好的成绩的。html

咱们为了判断一个模型是只会死记硬背仍是学会了数据的内在结构，就须要用测试集来检查模型是否能对没有学习过的数据进行准确判断。算法

如何进行训练集和测试集的划分

from sklearn.model_selection import train_test_split
from numpy import random
random.seed(2)
X = random.random(size=(12,4))
y = random.random(size=(12,1))
X_train, X_test, y_train,  y_test = train_test_split(X,y,test_size=0.25)
print ('X_train:\n')
print (X_train)
print ('\ny_train:\n')
print (y_train)
print ('\nX_test:\n')
print (X_test)
print ('\ny_test:\n')
print (y_test)

X_train:

[[ 0.4203678   0.33033482  0.20464863  0.61927097]
 [ 0.22030621  0.34982629  0.46778748  0.20174323]
 [ 0.12715997  0.59674531  0.226012    0.10694568]
 [ 0.4359949   0.02592623  0.54966248  0.43532239]
 [ 0.79363745  0.58000418  0.1622986   0.70075235]
 [ 0.13457995  0.51357812  0.18443987  0.78533515]
 [ 0.64040673  0.48306984  0.50523672  0.38689265]
 [ 0.50524609  0.0652865   0.42812233  0.09653092]
 [ 0.85397529  0.49423684  0.84656149  0.07964548]]

y_train:

[[ 0.95374223]
 [ 0.02720237]
 [ 0.40627504]
 [ 0.53560417]
 [ 0.06714437]
 [ 0.08209492]
 [ 0.24717724]
 [ 0.8508505 ]
 [ 0.3663424 ]]

X_test:

[[ 0.29965467  0.26682728  0.62113383  0.52914209]
 [ 0.96455108  0.50000836  0.88952006  0.34161365]
 [ 0.56714413  0.42754596  0.43674726  0.77655918]]

y_test:

[[ 0.54420816]
 [ 0.99385201]
 [ 0.97058031]]

sklearn.model_selection.train_test_split(arrays, *options)[source]segmentfault

Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.app

Parameters:
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int or RandomState Pseudo-random number generator state used for random sampling.
stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the class labels.
Returns:
splitting : list, length=2 * len(arrays) List containing train-test split of inputs. New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Confusion Matrix (Error Matrix)

Referencedom

A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier.
The entries in the confusion matrix have the following meaning in the context of our study:机器学习

a is the number of correct predictions that an instance is negative,
b is the number of incorrect predictions that an instance is positive,
c is the number of incorrect of predictions that an instance negative, and
d is the number of correct predictions that an instance is positive.
ide

Several standard terms have been defined for the 2 class matrix:函数

The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation:post

$$AC=\frac{a+d}{a+b+c+d}$$性能

The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation:

$$Recall=\frac{d}{c+d}$$

precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation:
$$P=\frac{d}{b+d}$$

The accuracy determined using equation 1 may not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998).
Accuracy在negative cases远多于positive cases的时候是不合适的，由于即便true prositive为0，accuracy依然能够很高。

Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), and F-Measure (Lewis and Gale, 1994).

$$g-mean=\sqrt{R\cdot P}$$

$$F_{\beta}=\frac{(\beta^2+1)PR}{\beta^2P+R}$$

F1 Score 就是F-Measure当$$\beta = 1$$时的特例

sklearn.metrics.confusion_matrix

ompute confusion matrix to evaluate the accuracy of a classification
By definition a confusion matrix C is such that $C_{i, j}$ is equal to the number of observations known to be in group i but predicted to be in group j.
Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
Read more in the User Guide.

Parameters
y_true : array, shape = [n_samples] Ground truth (correct) target values.
y_pred : array, shape = [n_samples] Estimated targets as returned by a classifier.
labels : array, shape = [n_classes], optional List of labels to index the matrix. This may be used to reorder or select a subset of labels. If none is given, those that appear at least once in y_true or y_pred are used in sorted order.
sample_weight : array-like of shape = [n_samples], optional Sample weights.
Returns:	C : array, shape = [n_classes, n_classes] Confusion matrix

Examples

from sklearn.metrics import confusion_matrix
y_true = [1, 0, 0, 1, 0, 1]
confusion_matrix(y_true, y_pred)

array([[2, 1],
       [1, 2]])

from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

分类器性能指标之ROC曲线、AUC值

一 roc曲线

roc曲线：接收者操做特征(receiver operating characteristic),roc曲线上每一个点反映着对同一信号刺激的感觉性。

横轴：负正类率(false postive rate FPR)特异度，划分实例中全部负例占全部负例的比例；(1-Specificity)
纵轴：真正类率(true postive rate TPR)灵敏度，Sensitivity(正类覆盖率)

针对一个二分类问题，将实例分红正类(postive)或者负类(negative)。可是实际中分类时，会出现四种状况.

TP:正确的确定数目
若一个实例是正类而且被预测为正类，即为真正类(True Postive TP)

FN:漏报，没有找到正确匹配的数目
若一个实例是正类，可是被预测成为负类，即为假负类(False Negative FN)
FP:误报，没有的匹配不正确
若一个实例是负类，可是被预测成为正类，即为假正类(False Postive FP)
TN:正确拒绝的非匹配数目
若一个实例是负类，可是被预测成为负类，即为真负类(True Negative TN)

列联表以下，1表明正类，0表明负类：

由上表可得出横，纵轴的计算公式：
(1)真正类率(True Postive Rate)TPR: TP/(TP+FN),表明分类器预测的正类中实际正实例占全部正实例的比例。Sensitivity

(2)负正类率(False Postive Rate)FPR: FP/(FP+TN)，表明分类器预测的正类中实际负实例占全部负实例的比例。1-Specificity

(3)真负类率(True Negative Rate)TNR: TN/(FP+TN),表明分类器预测的负类中实际负实例占全部负实例的比例，TNR=1-FPR。Specificity

假设采用逻辑回归分类器，其给出针对每一个实例为正类的几率，那么经过设定一个阈值如0.6，几率大于等于0.6的为正类，小于0.6的为负类。对应的就能够算出一组(FPR,TPR),在平面中获得对应坐标点。随着阈值的逐渐减少，愈来愈多的实例被划分为正类，可是这些正类中一样也掺杂着真正的负实例，即TPR和FPR会同时增大。阈值最大时，对应坐标点为(0,0),阈值最小时，对应坐标点(1,1)。

以下面这幅图，(a)图中实线为ROC曲线，线上每一个点对应一个阈值。

横轴FPR:1-TNR,1-Specificity，FPR越大，预测正类中实际负类越多。

纵轴TPR：Sensitivity(正类覆盖率),TPR越大，预测正类中实际正类越多。

理想目标：TPR=1，FPR=0,即图中(0,1)点，故ROC曲线越靠拢(0,1)点，越偏离45度对角线越好，Sensitivity、Specificity越大效果越好。

如何画roc曲线

对于一个特定的分类器和测试数据集，显然只能获得一个分类结果，即一组FPR和TPR结果，而要获得一个曲线，咱们实际上须要一系列FPR和TPR的值，这又是如何获得的呢？咱们先来看一下Wikipedia上对ROC曲线的定义：

问题在于“as its discrimination threashold is varied”。如何理解这里的“discrimination threashold”呢？咱们忽略了分类器的一个重要功能“几率输出”，即表示分类器认为某个样本具备多大的几率属于正样本（或负样本）。经过更深刻地了解各个分类器的内部机理，咱们总能想办法获得一种几率输出。一般来讲，是将一个实数范围经过某个变换映射到(0,1)区间。

假如咱们已经获得了全部样本的几率输出（属于正样本的几率），如今的问题是如何改变“discrimination threashold”？咱们根据每一个测试样本属于正样本的几率值从大到小排序。下图是一个示例，图中共有20个测试样本，“Class”一栏表示每一个测试样本真正的标签（p表示正样本，n表示负样本），“Score”表示每一个测试样本属于正样本的几率.

接下来，咱们从高到低，依次将“Score”值做为阈值threshold，当测试样本属于正样本的几率大于或等于这个threshold时，咱们认为它为正样本，不然为负样本。举例来讲，对于图中的第4个样本，其“Score”值为0.6，那么样本1，2，3，4都被认为是正样本，由于它们的“Score”值都大于等于0.6，而其余样本则都认为是负样本。每次选取一个不一样的threshold，咱们就能够获得一组FPR和TPR，即ROC曲线上的一点。这样一来，咱们一共获得了20组FPR和TPR的值，将它们画在ROC曲线的结果以下图：

当咱们将threshold设置为1和0时，分别能够获得ROC曲线上的(0,0)和(1,1)两个点。将这些(FPR,TPR)对链接起来，就获得了ROC曲线。当threshold取值越多，ROC曲线越平滑。

其实，咱们并不必定要获得每一个测试样本是正样本的几率值，只要获得这个分类器对该测试样本的“评分值”便可（评分值并不必定在(0,1)区间）。评分越高，表示分类器越确定地认为这个测试样本是正样本，并且同时使用各个评分值做为threshold。我认为将评分值转化为几率更易于理解一些。

AUC

AUC(Area under Curve): Roc曲线下的面积，介于0.1和1之间。Auc做为数值能够直观的评价分类器的好坏，值越大越好。

首先AUC值是一个几率值，当你随机挑选一个正样本以及负样本，当前的分类算法根据计算获得的Score值将这个正样本排在负样本前面的几率就是AUC值，AUC值越大，当前分类算法越有可能将正样本排在负样本前面，从而可以更好地分类。

为何使用Roc和Auc评价分类器

既然已经这么多标准，为何还要使用ROC和AUC呢？由于ROC曲线有个很好的特性：当测试集中的正负样本的分布变换的时候，ROC曲线可以保持不变。在实际的数据集中常常会出现样本类不平衡，即正负样本比例差距较大，并且测试数据中的正负样本也可能随着时间变化。下图是ROC曲线和Presision-Recall曲线的对比：

在上图中，(a)和(c)为Roc曲线，(b)和(d)为Precision-Recall曲线。

(a)和(b)展现的是分类其在原始测试集(正负样本分布平衡)的结果，(c)(d)是将测试集中负样本的数量增长到原来的10倍后，分类器的结果，能够明显的看出，ROC曲线基本保持原貌，而Precision-Recall曲线变化较大。

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred, pos_label=2)
print('gamma=1 AUC= ',metrics.auc(fpr, tpr))

clf=svm.SVC(kernel='rbf',C=1,gamma=5).fit(X_train,y_train)
y_pred_rbf=clf.predict(X_test)
fpr_ga1, tpr_ga1, thresholds_ga1 = metrics.roc_curve(y_test,y_pred_rbf, pos_label=2)
print('gamma=10 AUC= ',metrics.auc(fpr_ga1, tpr_ga1))

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 
y_pred_knn=neigh.predict(X_test)
fpr_knn, tpr_knn, thresholds_knn = metrics.roc_curve(y_test,y_pred_knn, pos_label=2)
print('knn AUC= ',metrics.auc(fpr_knn, tpr_knn))

plt.figure()
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr,label='gamma=1')
plt.plot(fpr_rbf,tpr_rbf,label='gamma=10')
plt.plot(fpr_knn,tpr_knn,label='knn')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

gamma=1 AUC=  0.927469135802
gamma=10 AUC=  0.927469135802
knn AUC=  0.936342592593

从上图能够看出，SVM gamma 取10要明显好于取1.

# Author: Tim Head <betatim@gmail.com>
#
# License: BSD 3 clause

import numpy as np
np.random.seed(10)

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
                              GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline

n_estimator = 10
X, y = make_classification(n_samples=80000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
                                                            y_train,
                                                            test_size=0.5)

# Unsupervised transformation based on totally random trees
rt = RandomTreesEmbedding(max_depth=3, n_estimators=n_estimator,
    random_state=0)

rt_lm = LogisticRegression()
pipeline = make_pipeline(rt, rt_lm)
pipeline.fit(X_train, y_train)
y_pred_rt = pipeline.predict_proba(X_test)[:, 1]
fpr_rt_lm, tpr_rt_lm, _ = roc_curve(y_test, y_pred_rt)

# Supervised transformation based on random forests
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf_enc = OneHotEncoder()
rf_lm = LogisticRegression()
rf.fit(X_train, y_train)
rf_enc.fit(rf.apply(X_train))
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(X_test)))[:, 1]
fpr_rf_lm, tpr_rf_lm, _ = roc_curve(y_test, y_pred_rf_lm)

grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)

y_pred_grd_lm = grd_lm.predict_proba(
    grd_enc.transform(grd.apply(X_test)[:, :, 0]))[:, 1]
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)


# The gradient boosted model by itself
y_pred_grd = grd.predict_proba(X_test)[:, 1]
fpr_grd, tpr_grd, _ = roc_curve(y_test, y_pred_grd)

# The random forest model by itself
y_pred_rf = rf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rt_lm, tpr_rt_lm, label='RT + LR')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_rf_lm, tpr_rf_lm, label='RF + LR')
plt.plot(fpr_grd, tpr_grd, label='GBT')
plt.plot(fpr_grd_lm, tpr_grd_lm, label='GBT + LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()

Errors

mean absoulte error 均绝对值偏差

绝对值函数不连续，因此没法用在梯度降低中，取而代之用MSE

mean squre error 均方偏差

from sklearn import datasets,svm,metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

iris=datasets.load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.5)
clf=svm.SVC(kernel='rbf',C=1,gamma=1).fit(X_train,y_train)
y_pred=clf.predict(X_test)
error=metrics.mean_absolute_error(y_test,y_pred)
print('mean_absolute_error: ',error)
print('mean_square_error: ',metrics.mean_squared_error(y_test,y_pred))

mean_absolute_error:  0.04
mean_square_error:  0.04

K Fold

from sklearn.model_selection import KFold
X=np.array([0,1,2,3,4,5,6,7,8,9])
kf=KFold(n_splits=10, random_state=3, shuffle=True)
for train_indices,test_indices in kf.split(X):
    print (train_indices,test_indices)

[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 5 6 7 8 9] [4]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 3 4 5 6 7 8] [9]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[1 2 3 4 5 6 7 8 9] [0]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 4 5 6 7 9] [8]