A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k”folds”:html
1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).gitThe performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.数组
上面这段话是引自sklearn的document中,对于cv的描述.描述了一个在交叉验证中的相同的规则就是,在解决实际问题中,咱们能够将全部的数据集dataset,划分为train_set(例如70%)和test_set(30%),而后在train_set上作cross_validation,最后取平均以后,再使用test_set测试模型的准确度.不是直接在dataset上直接作cross-validation(这个是我理解cv中的一个误区)app
k-fold原本不想写关于cross-validation的内容的,可是决定这里面本身的误区仍是不少的,因此写一下,若是有人看到了,也能够帮忙指出来.dom
1.A model is trained using k-1 of the folds as training data;
2.the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop机器学习
前提:整个数据集被分红了训练集D(70%)和测试集T(30%).函数
上面这段话就是k-fold的全过程,(此时只涉及到训练集D)
1.将整个训练集D分为k个等大的集合,而后选出k-1个做为模型的训练集.训练模型model1.
2.使用剩下的一个集合 Di ,做为验证集(和所谓的测试集的做用是同样的),测试model1的准确性.关于模型评估方法,能够参考sklearn实现的一些方法.
3.循环执行上述过程k次,保证没有重复.而后对于准确性求平均值,这就是该分类方法对应的正确性.
有人可能会问平均出来的正确性对应的模型权值 θ 是哪个?这个问题就须要明白机器学习的目的是什么?机器学习不是找到所谓模型对应的权值是多少,而是相对于实际问题,选出合适的模型(好比向量机模型)和合适的超参(好比核函数,c等超参).上述的平均正确率就是对应于模型+超参的.oop
搞懂了K-fold,就能够聊一聊GridSearch啦,由于GridSearch默认参数就是3-fold的,若是没有不懂cross-validation就很难理解这个.学习
想干什么Gridsearch是为了解决调参的问题.好比向量机SVM的经常使用参数有kernel,gamma,C等,手动调的话太慢了,写循环也只能顺序运行,不能并行.因而就出现了Gridsearch.经过它,能够直接找出最优的参数.测试
怎么调参param字典类型,它会将每一个字典类型里的字段全部的组合都输入到分类器中执行.
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] 如何评估参数输入以后,须要评估每组参数对应的模型的预测能力.Gridsearch就在数据集上作k-fold,而后求出每组参数对应模型的平均精确度.选出最优的参数.返回.
通常Gridsearch只在训练集上作k-fold并不会使用测试集.而是将测试集留在最后,当gridsearch选出最佳模型的时候,在使用测试集测试模型的泛化能力.
贴一个sklearn上面的例子
from sklearn import datasetsfrom sklearn.cross_validation import train_test_splitfrom sklearn.grid_search import GridSearchCVfrom sklearn.metrics import classification_reportfrom sklearn.svm import SVC# Loading the Digits datasetdigits = datasets.load_digits()# To apply an classifier on this data, we need to flatten the image, to# turn the data in a (samples, feature) matrix:n_samples = len(digits.images)X = digits.images.reshape((n_samples, -1))y = digits.target# 将数据集分红训练集和测试集X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0)# 设置gridsearch的参数tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]#设置模型评估的方法.若是不清楚,能够参考上面的k-fold章节里面的超连接scores = ['precision', 'recall']for score in scores: print("# Tuning hyper-parameters for %s" % score) print() #构造这个GridSearch的分类器,5-fold clf = GridSearchCV(SVC(), tuned_parameters, cv=5, scoring='%s_weighted' % score) #只在训练集上面作k-fold,而后返回最优的模型参数 clf.fit(X_train, y_train) print("Best parameters set found on development set:") print() #输出最优的模型参数 print(clf.best_params_) print() print("Grid scores on development set:") print() for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() #在测试集上测试最优的模型的泛化能力. y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred)) print()上面这个例子就符合通常的套路.例子中的SVC是支持多分类的,其默认使用的是ovo的方式,若是须要改变,能够将参数设置为decision_function_shape=’ovr’,具体的能够参看SVC的API文档.
须要注意的几个点1.GridSearch支不支持多分类?
GridSearch只是在将参数组合好了,而后将数据使用k-fold的方式输入到模型中,而后评估模型的准确性.其自己并非新的分类方法,因此只要你选择的estimator能够应用于多分类,就能够.上面的例子手写体的识别就是一个多分类的问题.你选择的模型评估方法也须要知足多分类问题.当你使用roc_auc的时候评估模型的时候就须要注意数据格式.
2.GridSearch的estimator有的时候会出现嵌套,好比adaboost()集成学习中,就须要Gridsearch支持嵌套参数.双下划线__就表示该参数是嵌套参数,内层的参数.(这一点我没有试验过,只是看到有人这样说…)固然gridsearch也有专门针对集成学习的API.
嵌套参数这篇博客有个例子:
———2017.4.18
Python机器学习包的sklearn中的Gridsearch简单使用
Gridsearch 机器 sklearn 简单 使用 Python 学习 包的