简单易懂的GBDT

时间 2019-11-18

标签简单易懂 gbdt 繁體版

原文原文链接

转https://www.cnblogs.com/liuyu124/p/7333080.htmlhtml

梯度提高决策树（Gradient Boosting Decision Tree，GBDT）算法是近年来被说起比较多的一个算法，这主要得益于其算法的性能，以及该算法在各种数据挖掘以及机器学习比赛中的卓越表现，有不少人对GBDT算法进行了开源代码的开发，比较火的是陈天奇的XGBoost和微软的LightGBM。node

1、监督学习

一、监督学习的主要任务

监督学习是机器学习算法中重要的一种，对于监督学习，假设有git

其中，github

为了可以对映射算法

对于一个具体的问题，如线性回归问题，其映射函数的形式为：框架

此时对于最优映射函数dom

二、梯度降低法

梯度降低法（Gradient Descent，GD）算法是求解最优化问题最简单、最直接的方法。梯度降低法是一种迭代的优化算法，对于优化问题：机器学习

其基本步骤为：函数

随机选择一个初始点
重复如下过程：
- 决定降低的方向：
- 选择步长
- 更新：
直到知足终止条件

梯度降低法的具体过程以下图所示：性能

由以上的过程，咱们能够看出，对于最终的最优解

三、在函数空间的优化

以上是在指定的函数空间中对最优函数进行搜索，那么，可否直接在函数空间（function space）中查找到最优的函数呢？根据上述的梯度降低法的思路，对于模型的损失函数

以函数

其中，

由上述的过程能够获得函数

与上面相似，函数

2、Boosting

一、集成方法之Boosting

Boosting方法是集成学习中重要的一种方法，在集成学习方法中最主要的两种方法为Bagging和Boosting，在Bagging中，经过对训练样本从新采样的方法获得不一样的训练样本集，在这些新的训练样本集上分别训练学习器，最终合并每个学习器的结果，做为最终的学习结果，Bagging方法的具体过程以下图所示：

在Bagging方法中，最重要的算法为随机森林Random Forest算法。由以上的图中能够看出，在Bagging方法中，

在Boosting方法中，最重要的方法包括：AdaBoost和GBDT。

二、Gradient Boosting

由上图所示的Boosting方法中，最终的预测结果为

这与上述的在函数空间中的优化相似：

根据如上的函数空间中的优化可知，每次对每个样本的训练的值为：

上创建模型，因为上述是一个求解梯度的过程，所以也称为基于梯度的Boost方法，其具体过程以下所示：

3、Gradient Boosting Decision Tree

在上面简单介绍了Gradient Boost框架，梯度提高决策树Gradient Boosting Decision Tree是Gradient Boost框架下使用较多的一种模型，在梯度提高决策树中，其基学习器是分类回归树CART，使用的是CART树中的回归树。

一、分类回归树CART

分类回归树CART算法是一种基于二叉树的机器学习算法，其既能处理回归问题，又能处理分类为题，在梯度提高决策树GBDT算法中，使用到的是CART回归树算法，对于CART树算法的更多信息，能够参考简单易学的机器学习算法——分类回归树CART。

对于一个包含了

其中，

开始时，CART树中只包含了根结点，全部样本都被划分在根结点上：

此时，计算该节点上的样本的方差（此处要乘以

其中，

那么，什么样本的划分才是当前的最好划分呢？此时计算左右子树的方差之和：

其中，

注意：对于上述最优划分标准的选择，以上的计算过程能够进一步优化。

首先，对于

而对于

经过以上的过程，咱们发现，划分前，记录节点的值为：

当划分后，两个节点的值的和为：

最好的划分，对应着两个节点的值的和的最大值。

二、GBDT——二分类

在梯度提高决策树GBDT中，经过定义不一样的损失函数，能够完成不一样的学习任务，二分类是机器学习中一类比较重要的分类算法，在二分类中，其损失函数为：

套用上面介绍的GB框架，获得下述的二分类GBDT的算法：

在构建每一棵CART回归树的过程当中，对一个样本的预测值应与

在

由Newton-Raphson迭代公式可得：

以参考文献3 Idiots’ Approach for Display Advertising Challenge中提供的代码为例：

GBDT训练的主要代码为：

void GBDT::fit(Problem const &Tr, Problem const &Va) { bias = calc_bias(Tr.Y); //用于初始化的F std::vector<float> F_Tr(Tr.nr_instance, bias), F_Va(Va.nr_instance, bias); Timer timer; printf("iter time tr_loss va_loss\n"); // 开始训练每一棵CART树 for(uint32_t t = 0; t < trees.size(); ++t) { timer.tic(); std::vector<float> const &Y = Tr.Y; std::vector<float> R(Tr.nr_instance), F1(Tr.nr_instance); // 记录残差和F #pragma omp parallel for schedule(static) for(uint32_t i = 0; i < Tr.nr_instance; ++i) R[i] = static_cast<float>(Y[i]/(1+exp(Y[i]*F_Tr[i]))); //计算残差，或者称为梯度降低的方向 // 利用上面的残差值，在此函数中构造一棵树 trees[t].fit(Tr, R, F1); // 分类树的生成 double Tr_loss = 0; // 用上面训练的结果更新F_Tr，并计算log_loss #pragma omp parallel for schedule(static) reduction(+: Tr_loss) for(uint32_t i = 0; i < Tr.nr_instance; ++i) { F_Tr[i] += F1[i]; Tr_loss += log(1+exp(-Y[i]*F_Tr[i])); } Tr_loss /= static_cast<double>(Tr.nr_instance); // 用上面训练的结果预测测试集，打印log_loss #pragma omp parallel for schedule(static) for(uint32_t i = 0; i < Va.nr_instance; ++i) { std::vector<float> x = construct_instance(Va, i); F_Va[i] += trees[t].predict(x.data()).second; } double Va_loss = 0; #pragma omp parallel for schedule(static) reduction(+: Va_loss) for(uint32_t i = 0; i < Va.nr_instance; ++i) Va_loss += log(1+exp(-Va.Y[i]*F_Va[i])); Va_loss /= static_cast<double>(Va.nr_instance); printf("%4d %8.1f %10.5f %10.5f\n", t, timer.toc(), Tr_loss, Va_loss); fflush(stdout); } }

CART回归树的训练代码为：

void CART::fit(Problem const &prob, std::vector<float> const &R, std::vector<float> &F1){ uint32_t const nr_field = prob.nr_field; // 特征的个数 uint32_t const nr_sparse_field = prob.nr_sparse_field; uint32_t const nr_instance = prob.nr_instance; // 样本的个数 std::vector<Location> locations(nr_instance); // 样本信息 #pragma omp parallel for schedule(static) for(uint32_t i = 0; i < nr_instance; ++i) locations[i].r = R[i]; // 记录每个样本的残差 for(uint32_t d = 0, offset = 1; d < max_depth; ++d, offset *= 2){// d:深度 uint32_t const nr_leaf = static_cast<uint32_t>(pow(2, d)); // 叶子节点的个数 std::vector<Meta> metas0(nr_leaf); // 叶子节点的信息 for(uint32_t i = 0; i < nr_instance; ++i){ Location &location = locations[i]; //第i个样本的信息 if(location.shrinked) continue; Meta &meta = metas0[location.tnode_idx-offset]; //找到对应的叶子节点 meta.s += location.r; //残差之和 ++meta.n; } std::vector<Defender> defenders(nr_leaf*nr_field); //记录每个叶节点的每一维特征 std::vector<Defender> defenders_sparse(nr_leaf*nr_sparse_field); // 针对每个叶节点 for(uint32_t f = 0; f < nr_leaf; ++f){ Meta const &meta = metas0[f]; // 叶子节点 double const ese = meta.s*meta.s/static_cast<double>(meta.n); //该叶子节点的ese for(uint32_t j = 0; j < nr_field; ++j) defenders[f*nr_field+j].ese = ese; for(uint32_t j = 0; j < nr_sparse_field; ++j) defenders_sparse[f*nr_sparse_field+j].ese = ese; } std::vector<Defender> defenders_inv = defenders; std::thread thread_f(scan, std::ref(prob), std::ref(locations), std::ref(metas0), std::ref(defenders), offset, true); std::thread thread_b(scan, std::ref(prob), std::ref(locations), std::ref(metas0), std::ref(defenders_inv), offset, false); scan_sparse(prob, locations, metas0, defenders_sparse, offset, true); thread_f.join(); thread_b.join(); // 找出最佳的ese，scan里是每一个字段的最佳ese，这里是全部字段的最佳ese，赋值给相应的tnode for(uint32_t f = 0; f < nr_leaf; ++f){ // 对于每个叶节点都找到最好的划分 Meta const &meta = metas0[f]; double best_ese = meta.s*meta.s/static_cast<double>(meta.n); TreeNode &tnode = tnodes[f+offset]; for(uint32_t j = 0; j < nr_field; ++j){ Defender defender = defenders[f*nr_field+j];//每个叶节点都对应着全部的特征 if(defender.ese > best_ese) { best_ese = defender.ese; tnode.feature = j; tnode.threshold = defender.threshold; } defender = defenders_inv[f*nr_field+j]; if(defender.ese > best_ese) { best_ese = defender.ese; tnode.feature = j; tnode.threshold = defender.threshold; } } for(uint32_t j = 0; j < nr_sparse_field; ++j) { Defender defender = defenders_sparse[f*nr_sparse_field+j]; if(defender.ese > best_ese) { best_ese = defender.ese; tnode.feature = nr_field + j; tnode.threshold = defender.threshold; } } } // 把每一个instance都分配给树里的一个叶节点下 #pragma omp parallel for schedule(static) for(uint32_t i = 0; i < nr_instance; ++i){ Location &location = locations[i]; if(location.shrinked) continue; uint32_t &tnode_idx = location.tnode_idx; TreeNode &tnode = tnodes[tnode_idx]; if(tnode.feature == -1){ location.shrinked = true; }else if(static_cast<uint32_t>(tnode.feature) < nr_field){ if(prob.Z[tnode.feature][i].v < tnode.threshold) tnode_idx = 2*tnode_idx; else tnode_idx = 2*tnode_idx+1; }else{ uint32_t const target_feature = static_cast<uint32_t>(tnode.feature-nr_field); bool is_one = false; for(uint64_t p = prob.SJP[i]; p < prob.SJP[i+1]; ++p) { if(prob.SJ[p] == target_feature) { is_one = true; break; } } if(!is_one) tnode_idx = 2*tnode_idx; else tnode_idx = 2*tnode_idx+1; } } } // 用于计算gamma std::vector<std::pair<double, double>> tmp(max_tnodes, std::make_pair(0, 0)); for(uint32_t i = 0; i < nr_instance; ++i) { float const r = locations[i].r; uint32_t const tnode_idx = locations[i].tnode_idx; tmp[tnode_idx].first += r; tmp[tnode_idx].second += fabs(r)*(1-fabs(r)); } for(uint32_t tnode_idx = 1; tnode_idx <= max_tnodes; ++tnode_idx) { double a, b; std::tie(a, b) = tmp[tnode_idx]; tnodes[tnode_idx].gamma = (b <= 1e-12)? 0 : static_cast<float>(a/b); } #pragma omp parallel for schedule(static) for(uint32_t i = 0; i < nr_instance; ++i) F1[i] = tnodes[locations[i].tnode_idx].gamma;// 从新更新F1的值 }

在参考文献A simple GBDT in Python中提供了Python实现的GBDT的版本。