在 Kaggle 的不少比赛中,咱们能够看到不少 winner 喜欢用 xgboost,并且得到很是好的表现,今天就来看看 xgboost 究竟是什么以及如何应用。html
XGBoost :eXtreme Gradient Boosting
项目地址:https://github.com/dmlc/xgboostpython
是由 Tianqi Chen http://homes.cs.washington.edu/~tqchen/ 最初开发的实现可扩展,便携,分布式 gradient boosting (GBDT, GBRT or GBM) 算法的一个库,能够下载安装并应用于 C++,Python,R,Julia,Java,Scala,Hadoop,如今有不少协做者共同开发维护。ios
XGBoost 所应用的算法就是 gradient boosting decision tree,既能够用于分类也能够用于回归问题中。git
那什么是 Gradient Boosting?github
Gradient boosting 是 boosting 的其中一种方法web
所谓 Boosting ,就是将弱分离器 f_i(x) 组合起来造成强分类器 F(x) 的一种方法。算法
因此 Boosting 有三个要素:api
A loss function to be optimized:
例如分类问题中用 cross entropy,回归问题用 mean squared error。数组
A weak learner to make predictions:
例如决策树。dom
An additive model:
将多个弱学习器累加起来组成强学习器,进而使目标损失函数达到极小。
Gradient boosting 就是经过加入新的弱学习器,来努力纠正前面全部弱学习器的残差,最终这样多个学习器相加在一块儿用来进行最终预测,准确率就会比单独的一个要高。之因此称为 Gradient,是由于在添加新模型时使用了梯度降低算法来最小化的损失。
前面已经知道,XGBoost 就是对 gradient boosting decision tree 的实现,可是通常来讲,gradient boosting 的实现是比较慢的,由于每次都要先构造出一个树并添加到整个模型序列中。
而 XGBoost 的特色就是计算速度快,模型表现好,这两点也正是这个项目的目标。
表现快是由于它具备这样的设计:
下图就是 XGBoost 与其它 gradient boosting 和 bagged decision trees 实现的效果比较,能够看出它比 R, Python,Spark,H2O 中的基准配置要更快。
另一个优势就是在预测问题中模型表现很是好,下面是几个 kaggle winner 的赛后采访连接,能够看出 XGBoost 的在实战中的效果。
先来用 Xgboost 作一个简单的二分类问题,如下面这个数据为例,来判断病人是否会在 5 年内患糖尿病,这个数据前 8 列是变量,最后一列是预测值为 0 或 1。
数据描述:
https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
下载数据集,并保存为 “pima-indians-diabetes.csv“ 文件:
https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data
引入 xgboost 等包
from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
分出变量和标签
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") X = dataset[:,0:8] Y = dataset[:,8]
将数据分为训练集和测试集,测试集用来预测,训练集用来学习模型
seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
xgboost 有封装好的分类器和回归器,能够直接用 XGBClassifier 创建模型
这里是 XGBClassifier 的文档:
http://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
model = XGBClassifier()
model.fit(X_train, y_train)
xgboost 的结果是每一个样本属于第一类的几率,须要用 round 将其转换为 0 1 值
y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred]
获得 Accuracy: 77.95%
accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))
xgboost 能够在模型训练时,评价模型在测试集上的表现,也能够输出每一步的分数
只须要将
model = XGBClassifier()
model.fit(X_train, y_train)
变为:
model = XGBClassifier() eval_set = [(X_test, y_test)] model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
那么它会在每加入一颗树后打印出 logloss
[31] validation_0-logloss:0.487867 [32] validation_0-logloss:0.487297 [33] validation_0-logloss:0.487562
并打印出 Early Stopping 的点:
Stopping. Best iteration:
[32] validation_0-logloss:0.487297
gradient boosting 还有一个优势是能够给出训练好的模型的特征重要性,
这样就能够知道哪些变量须要被保留,哪些能够舍弃
须要引入下面两个类
from xgboost import plot_importance from matplotlib import pyplot
和前面的代码相比,就是在 fit 后面加入两行画出特征的重要性
model.fit(X, y)
plot_importance(model)
pyplot.show()
如何调参呢,下面是三个超参数的通常实践最佳值,能够先将它们设定为这个范围,而后画出 learning curves,再调解参数找到最佳模型:
接下来咱们用 GridSearchCV 来进行调参会更方便一些:
能够调的超参数组合有:
树的个数和大小 (n_estimators and max_depth)
.
学习率和树的个数 (learning_rate and n_estimators)
.
行列的 subsampling rates (subsample, colsample_bytree and colsample_bylevel)
.
下面以学习率为例:
先引入这两个类
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold
设定要调节的 learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
和原代码相比就是在 model 后面加上 grid search 这几行:
model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, Y)
最后会给出最佳的学习率为 0.1
Best: -0.483013 using {'learning_rate': 0.1}
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
咱们还能够用下面的代码打印出每个学习率对应的分数:
means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) -0.689650 (0.000242) with: {'learning_rate': 0.0001} -0.661274 (0.001954) with: {'learning_rate': 0.001} -0.530747 (0.022961) with: {'learning_rate': 0.01} -0.483013 (0.060755) with: {'learning_rate': 0.1} -0.515440 (0.068974) with: {'learning_rate': 0.2} -0.557315 (0.081738) with: {'learning_rate': 0.3}
最后附上完整的代码
# coding=utf-8
from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from xgboost import plot_importance from matplotlib import pyplot from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") X = dataset[:, 0:8] Y = dataset[:, 8] seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, Y) eval_set = [(X_test, y_test)] model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True) # plot_importance(model) # pyplot.show() y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) #最佳的学习率 print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # 打印出每个学习率对应的分数 means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))
转载连接:https://www.jianshu.com/p/7e0e2d66b3d4