K 折交叉验证：关于StratifiedKFold 与 KFold 的区别与联系

时间 2019-11-06

标签交叉验证关于 stratifiedkfold kfold 区别联系繁體版

原文原文链接

训练神经网络时的关键一步时要评估模型的泛化能力，一个模型若是性能很差，要么是由于模型过于复杂致使过拟合(高方差)，要么是模型过于简单致使致使欠拟合(高误差)。要解决这一问题，咱们就要学会两种交叉验证计数——holdout交叉验证和k折交叉验证，来评估模型的泛化能力。html

计算K折交叉验证结果的平均值做为参数来评估模型，故而使用k折交叉验证来寻找最优参数要比holdout方法更稳定。一旦咱们找到最优参数，要使用这组参数在原始数据集上训练模型做为最终的模型。python

1 KFold概念释义

class sklearn.model_selection.KFold(n_splits=’warn’, shuffle=False, random_state=None)

将数据集拆分为 K 个连续的折叠（默认状况下不进行混洗），返回被拆分后各数据集的索引indices。git

而后将每一个折叠用做一次验证，而剩余 k-1 个折叠看成训练集。网络

参数：dom

n_splits：int，默认为 5（V0.22版本以前默认为3），但折数至少为 2.机器学习

shuffle：可选，默认False，即拆分前不对数据进行混洗。函数

random_state：可选，默认为None，若是为int，则random_state是随机数生成器使用的种子。oop

链接：https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn-model-selection-kfoldpost

该class 中的一个重要函数：性能

split(self, X, y=None, groups=None)

生成索引以将数据分红训练集和测试集

Parameters:	X : array-like, shape (n_samples, n_features) Training data, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape (n_samples,) The target variable for supervised learning problems. groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set.
Yields:	train : ndarray The training set indices for that split. test : ndarray The testing set indices for that split.

2 StratifiedKFold 概念释义

class sklearn.model_selection.StratifiedKFold(n_splits=’warn’, shuffle=False, random_state=None)

分层K 折交叉验证器，实现分层采样交叉切分。

将数据集拆分为 K 个连续的折叠（默认状况下不进行混洗），返回被拆分后各数据集的索引indices。

此交叉验证对象是 KFold 的变体，它返回分层的折叠；折叠是经过保留每一个类别的样品百分比来进行的。这样能够保证被分的各个子数据集中的各种比例与原数据集比例一致。

实际上，

参数：

n_splits：int，默认为 5（V0.22版本以前默认为3），但折数至少为 2.

shuffle：可选，默认False，即拆分前不对数据进行混洗。

random_state：可选，默认为None，若是为int，则random_state是随机数生成器使用的种子。

实际上，实例化时的参数是同样的，不同的是二者中 split 函数的使用内容。

split（self，X，y，groups = None ）

把数据分红训练集和测试集索引 indices

参数：

X：array-like, shape (n_samples, n_features)

训练数据，其中n_samples是样本数，n_features是特征数。

y：array-like, shape (n_samples,)

有关监督学习问题的目标变量；分层是基于Y标签进行的。

groups：object

Always ignored, exists for compatibility.

Yields：
train：ndarray，被分割的训练数据集索引indices

test：ndarray，被分割的测试集索引indices

3 GroupKFold

sklearn.model_selection.GroupKFold

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold

略

4 实例分析

import numpy as np
import string
from sklearn.model_selection import KFold, StratifiedKFold

# 建立训练集
train_num = np.array(range(10)).reshape(5, 2)
# 建立label集合
label_str = np.array(["a", "b", "a", "a", "b"])
# print("train_num：\n", train_num)
# print("label_str：\n", label_str)
# 实例化
sfk = StratifiedKFold(n_splits=3)
for train_index, test_index in sfk.split(train_num, label_str):
    new_train_num, new_test_num = train_num[train_index], train_num[test_index]
    new_label_str, new_label_str = label_str[train_index], label_str[test_index]

    print('train set:{} \n'.format(new_train_num))
    print('train label set:{} \n'.format(new_label_str))
    print('test set:{} \n'.format(new_test_num))
    print('test label set:{} \n'.format(new_label_str))

运行

train set:[[4 5]
 [6 7]
 [8 9]] 
train label set:['a' 'a' 'b'] 
test set:[[0 1]
 [2 3]] 
test label set:['a' 'b'] 
train set:[[0 1]
 [2 3]
 [6 7]] 
train label set:['a' 'b' 'a'] 
test set:[[4 5]
 [8 9]] 
test label set:['a' 'b'] 
train set:[[0 1]
 [2 3]
 [4 5]
 [8 9]] 
train label set:['a' 'b' 'a' 'b'] 
test set:[[6 7]] 
test label set:['a']

这种条件下运行是能够的。

将代码中的 StratifiedKFold 用 KFold 代替，效果相似，可是在 KFold 语法中默认 y = None，因此当使用分层采用交叉验证时，仍是用 StratifiedKFold 类为好。

该段代码中使用的 n_splits=3 ，使得 StratifiedKFold 与 KFold 效果相似，可是，将 n_splits=4 时，KFold 可运行并输出结果，而 StratifiedKFold 却报错

ValueError: n_splits=4 cannot be greater than the number of members in each class.

从这一点来说，分层采样交叉验证时仍是老老实实地使用 StratifiedKFold 类。

KFold 类的常规使用以下：

import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold

# 建立训练集
train_num = np.array(range(10)).reshape(5, 2)
# print("train_num：\n", train_num)
#
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(train_num):
    new_train_num, new_test_num = train_num[train_index], train_num[test_index]

我应用的部分实例片断

# 模型的编译
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy']
              )

# 将验证集从train中分离出来
# 对数据创建5折交叉验证的划分，K分层交叉采样类实例化
sfd = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
for train_index, val_index in sfd.split(train_images, train_labels):
    print("train_images.shape:", train_images.shape)
    new_train_images, new_val_images = train_images[train_index], train_images[val_index]
    new_train_labels, new_val_labels = train_labels[train_index], train_labels[val_index]
    print("new_train_images.shape:", new_train_images.shape)

    # 模型的训练
    history = model.fit(new_train_images, new_train_labels,
                        epochs=100,
                        # batch_size=512,
                        validation_data=(new_val_images, new_val_labels),
                        callbacks=[visualization.model_point, visualization.model_stop]
                        )
# 绘制出训练过程当中的损失值loss和度量值accuracy
visualization.plot_history(history)

5 train_test_split

若是不使用K交叉验证，则直接使用 train_test_split 函数便可。直接分离。

from sklearn.model_selection import train_test_split
train_images, val_images, train_labels, val_labels = \
    train_test_split(train_images, train_labels,
                     test_size=val_train_ratio,
                     )