[机器学习]回归--Decision Tree Regression

时间 2019-11-08

标签机器学习回归 decision tree regression 繁體版

原文原文链接

CART决策树又称分类回归树，当数据集的因变量为连续性数值时，该树算法就是一个回归树，能够用叶节点观察的均值做为预测值；当数据集的因变量为离散型数值时，该树算法就是一个分类树，能够很好的解决分类问题。但须要注意的是，该算法是一个二叉树，即每个非叶节点只能引申出两个分支，因此当某个非叶节点是多水平(2个以上)的离散变量时，该变量就有可能被屡次使用。
node

在sklearn中咱们能够用来提升决策树泛化能力的超参数主要有
- max_depth:树的最大深度,也就是说当树的深度到达max_depth的时候不管还有多少能够分支的特征,决策树都会中止运算.
- min_samples_split: 分裂所需的最小数量的节点数.当叶节点的样本数量小于该参数后,则再也不生成分支.该分支的标签分类以该分支下标签最多的类别为准
- min_samples_leaf; 一个分支所须要的最少样本数,若是在分支以后,某一个新增叶节点的特征样本数小于该超参数,则退回,再也不进行剪枝.退回后的叶节点的标签以该叶节点中最多的标签你为准
- min_weight_fraction_leaf: 最小的权重系数
- max_leaf_nodes:最大叶节点数,None时无限制,取整数时,忽略max_depth
python

咱们此次用的数据是公司内部不一样的promotion level所对应的薪资算法

下面咱们来看一下在Python中是如何实现的app

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
# 这里注意：1:2其实只有第一列，与1 的区别是这表示的是一个matrix矩阵，而非单一贯量。
y = dataset.iloc[:, 2].values

下来，进入正题，开始Decision Tree Regression回归：

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)

y_pred = regressor.predict(6.5)

# 图像中显示
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (Decision Tree Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

下面的代码主要是对决策树最大深度与过拟合之间关系的探讨,能够看出对于最大深度对拟合关系影响.
dom

与分类决策树同样的地方在于,最大深度的增长虽然能够增长对训练集拟合能力的加强,但这也就可能意味着其泛化能力的降低测试

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(10 * rng.rand(160, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 2 * (0.5 - rng.rand(32)) # 每五个点增长一次噪音

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=4)
regr_3 = DecisionTreeRegressor(max_depth=8)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

# Predict
X_test = np.arange(0.0, 10.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)

# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black",
            c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue",
         label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=4", linewidth=2)
plt.plot(X_test, y_3, color="r", label="max_depth=8", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

从上面的测试能够看出随着决策树最大深度的增长,决策树的拟合能力不断上升.
在这个例子中一共有160个样本,当最大深度为8(大于lg(200))时,咱们的决策树已经不单单拟合了咱们的正确样本,同时也拟合了咱们添加的噪音,这致使了其泛化能力的降低.spa

最大深度与训练偏差测试偏差的关系

下面咱们进行对于不一样的最大深度决策树的训练偏差与测试偏差进行绘制.
固然你也能够经过改变其余能够控制决策树生成的超参数进行相关测试.3d

from sklearn import model_selection
def creat_data(n):
    np.random.seed(0)
    X = 5 * np.random.rand(n, 1)
    y = np.sin(X).ravel()
    noise_num=(int)(n/5)
    y[::5] += 3 * (0.5 - np.random.rand(noise_num)) # 每第5个样本，就在该样本的值上添加噪音
    return model_selection.train_test_split(X, y,test_size=0.25,random_state=1)
def test_DecisionTreeRegressor_depth(*data,maxdepth):
    X_train,X_test,y_train,y_test=data
    depths=np.arange(1,maxdepth)
    training_scores=[]
    testing_scores=[]
    for depth in depths:
        regr = DecisionTreeRegressor(max_depth=depth)
        regr.fit(X_train, y_train)
        training_scores.append(regr.score(X_train,y_train))
        testing_scores.append(regr.score(X_test,y_test))

    ## 绘图
    fig=plt.figure()
    ax=fig.add_subplot(1,1,1)
    ax.plot(depths,training_scores,label="traing score")
    ax.plot(depths,testing_scores,label="testing score")
    ax.set_xlabel("maxdepth")
    ax.set_ylabel("score")
    ax.set_title("Decision Tree Regression")
    ax.legend(framealpha=0.5)
    plt.show()

X_train,X_test,y_train,y_test=creat_data(200)    
test_DecisionTreeRegressor_depth(X_train,X_test,y_train,y_test,maxdepth=12)

由上图咱们能够看出,当咱们使用train_test进行数据集的分割的时候,最大深度2即为咱们须要的最佳超参数.
code

一样的你也能够对其余超参数进行测试,或者换用cv进行测试,再或者使用hyperopt or auto-sklearn等神器blog