手把手教写出XGBoost实战程序

时间 2019-11-24

标签手把手写出 xgboost 实战程序繁體版

原文原文链接

简单介绍：

这是一个真实的比赛。赛题来源是天池大数据的 "商场中精肯定位用户所在店铺"。原数据有114万条，计算起来很是困难。为了让初学者有一个更好的学习体验，也更加基础，我将数据集缩小了以后放在这里，密码：ndfd。供你们下载。python

在个人数据中，数据是这样子的： train.csvlinux

user_id	用户的id	time_stamp	时间戳
latitude	纬度	wifi_strong 1-10	十个wifi的信号强度
longitude	经度	wifi_id 1-10	十个wifi的id
shop_id	商店的id	con_sta 1-10	十个wifi链接状态

test.csvgit

user_id	用户的id	time_stamp	时间戳
latitude	纬度	wifi_id 1-10	十个wifi的id
longitude	经度	con_sta 1-10	十个wifi链接状态
row_id	行标	wifi_strong 1-10	十个wifi的信号强度
shop_id	商店的id

这个题目的意思是，咱们在商场中，因为不一样层数和GPS精度限制，咱们并不能仅根据经纬度准确知道某用户具体在哪一家商店中。咱们经过手机与附近10个wifi点的链接状况，来精准判断出用户在哪一个商店中。方便公司根据用户的位置投放相应店家的广告。算法

开始实战

准备实战以前，固然要对整个XGBoost有一个基本了解，对这个模型不太熟悉的朋友，建议看我以前的文章《XGBoost》。编程

实战的流程通常是先将数据预处理，成为咱们模型可处理的数据，包括丢失值处理，数据拆解，类型转换等等。而后将其导入模型运行，最后根据结果正确率调整参数，反复调参数达到最优。windows

咱们在机器学习实战的时候必定要脱离一个思惟惯性————一切都得咱们思考周全才能够运行。这是一个颇有趣的思惟惯性，怎么解释呢？好比这道赛题，我也是学通讯出身的，看到十个wifi强度值，就想找这中间的关系，而后编程来求解人的确切位置。这本质上仍是咱们的思惟停留在显式编程的层面上，以为程序只有写清楚才可达到预约的目标。但其实大数据处理并非这个原理。决策树无论遇到什么数据，不管是时间仍是地理位置，都是同样的按照必定规则生成树，最后让新数据按照这个树走一遍获得预测的结果。也就是说咱们没必要花不少精力去考虑每一个数据的具体物理意义，只要把他们放进模型里面就能够了。(调参须要简单地考虑物理意义来给各个数据以权重，这个之后再说)bash

分析一下数据

咱们的数据的意义都在上面那张表里面，咱们有用户的id、经纬度、时间戳、商店id、wifi信息。咱们简单思考能够知道：多线程

user_id并无什么实际意义，仅仅是一个代号而已
shop_id是咱们预测的目标，咱们题目要求就是咱们根据其余信息来预测出用户所在的shop_id,因此 shop_id 是咱们的训练目标
经纬度跟咱们的位置有关，是有用的信息
wifi_id 让咱们知道是哪一个路由器，这个不一样的路由器位置不同，因此有用
wifi_strong是信号强度，跟咱们离路由器距离有关，有用
con_sta是链接状态，也就是有没有连上。原本我看数据中基本都是没连上，觉得没有用。后来得高人提醒，说若是有人自动连上某商店wifi，不是能够说明他常来么，这个对于判断顾客也是有一点用的。
咱们看test.csv整体差很少，就多了个row_id,咱们输出结果要注意对应上就能够

python库准备

import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
复制代码

咱这个XGBoost比较简单，因此就使用了最必要的三个库，pandas数据处理库，xgboost库，从大名鼎鼎的机器学习库sklearn中导入了preprocessing库，这个pandas库对数据的基本处理有不少封装函数，用起来比较顺手。想看例子的戳这个连接，我写的pandas.Dataframe基本拆解数据的方法。app

先进行数据预处理

咱得先导入一份数据：机器学习

train = pd.read_csv(r'D:\XGBoost_learn\mall_location\train2.csv')
tests = pd.read_csv(r'D:\XGBoost_learn\mall_location\test_pre.csv')
复制代码

咱们使用pandas里面的read_csv函数直接读取csv文件。csv文件全名是Comma-Separated Values文件，就是每一个数据之间都以逗号隔开，比较简洁，也是各个数据比赛经常使用的格式。咱们须要注意的是路径问题，windows下是\,linux下是/，这个有区别。而且咱们写的路径常常会与库里的函数字段重合，因此在路径最前加一个r来禁止与库里匹配，重合报错。r是raw的意思，生的，你们根据名字自行理解一下。

咱们的time_stamp原来是一个str类型的数据，计算机是不会知道它是什么东西的，只知道是一串字符串。因此咱们进行转化成datetime处理：

train['time_stamp'] = pd.to_datetime(pd.Series(train['time_stamp']))
tests['time_stamp'] = pd.to_datetime(pd.Series(tests['time_stamp']))
复制代码

train和tests都要处理。这也体现了pandas的强大。接下来咱们看time_stamp数据的样子：2017/8/6 21:20，看数据集可知，是一个十分钟为精确度(粒度)的数据，感受这个数据包含太多信息了呢，放一块儿很浪费(实际上是容易过拟合，由于一个结点会被分的很细)，咱们就将其拆开吧：

train['Year'] = train['time_stamp'].apply(lambda x: x.year)
train['Month'] = train['time_stamp'].apply(lambda x: x.month)
train['weekday'] = train['time_stamp'].dt.dayofweek
train['time'] = train['time_stamp'].dt.time
tests['Year'] = tests['time_stamp'].apply(lambda x: x.year)
tests['Month'] = tests['time_stamp'].apply(lambda x: x.month)
tests['weekday'] = tests['time_stamp'].dt.dayofweek
tests['time'] = tests['time_stamp'].dt.time
复制代码

细心的朋友可能会发现，这里采用了两种写法，一种是.apply(lambda x: x.year)，这是什么意思呢？这实际上是采用了一种叫匿名函数的写法.匿名函数就是咱们相要写一个函数，但并不想费神去思考这个函数该如何命名，这时候咱们就须要一个匿名函数，来实现一些小功能。咱们这里采用的是.apply(lambda x: x.year)其实是调用了apply函数，是加这一列的意思，加的列的内容就是x.year。咱们要是以为这样写不直观的话，也能够这样写：

YearApply(x)：
   return x.year
   
train['Year'] = train['time_stamp'].apply(YearApply)
复制代码

这两种写法意义都是同样的。在调用weekday和datetime的时候，咱们使用的是numpy里面的函数dt，用法如代码所示。其实这weekday也能够这样写： train['weekday'] = train['time_stamp'].apply(lambda x: x.weekday())，注意多了个括号，因为weekday须要计算一下才能够获得，因此还调用了一下内部的函数。为何采用weekday呢，由于星期几比几号对于购物来讲更加有特征性。接下来咱们将这个time_stamp丢掉，由于已经有了year、month那些：

train = train.drop('time_stamp', axis=1)
tests = tests.drop('time_stamp', axis=1)
复制代码

再丢掉缺失值，或者补上缺失值。

train = train.dropna(axis=0)
tests = tests.fillna(method='pad')
复制代码

咱们看到我对训练集和测试集作了两种不一样方式的处理。训练集数据比较多，并且缺失值比例比较少，因而就将全部缺失值使用dropna函数，tests文件由于是测试集，不能丢失一个信息，哪怕数据不少缺失值不多，因此咱们用各类方法来补上，这里采用前一个非nan值补充的方式（method=“pad”），固然也有其余方式，好比用这一列出现频率最高的值来补充。

class DataFrameImputer(TransformerMixin):
   def fit(self, X, y=None):
       for c in X:
           if X[c].dtype == np.dtype('O'):
               fill_number = X[c].value_counts().index[0]
               self.fill = pd.Series(fill_number, index=X.columns)
           else:
               fill_number = X[c].median()
               self.fill = pd.Series(fill_number, index=X.columns)
       return self
       
       def transform(self, X, y=None):
           return X.fillna(self.fill)
       
train = DataFrameImputer().fit_transform(train)
复制代码

这一段代码有一点拗口，意思是对于X中的每个c，若是X[c]的类型是object（‘O’表示object）的话就将[X[c].value_counts().index[0]传给空值，[X[c].value_counts().index[0]表示的是重复出现最多的那个数，若是不是object类型的话，就传回去X[c].median()，也就是这些数的中位数。

在这里咱们可使用print来输出一下咱们的数据是什么样子的。

print(train.info())
复制代码

<class 'pandas.core.frame.DataFrame' at 0x0000024527C50D08>
Int64Index: 467 entries, 0 to 499
Data columns (total 38 columns):
user_id          467 non-null object
shop_id          467 non-null object
longitude        467 non-null float64
latitude         467 non-null float64
wifi_id1         467 non-null object
wifi_strong1     467 non-null int64
con_sta1         467 non-null bool
wifi_id2         467 non-null object
wifi_strong2     467 non-null int64
con_sta2         467 non-null object
wifi_id3         467 non-null object
wifi_strong3     467 non-null float64
con_sta3         467 non-null object
wifi_id4         467 non-null object
wifi_strong4     467 non-null float64
con_sta4         467 non-null object
wifi_id5         467 non-null object
wifi_strong5     467 non-null float64
con_sta5         467 non-null object
wifi_id6         467 non-null object
wifi_strong6     467 non-null float64
con_sta6         467 non-null object
wifi_id7         467 non-null object
wifi_strong7     467 non-null float64
con_sta7         467 non-null object
wifi_id8         467 non-null object
wifi_strong8     467 non-null float64
con_sta8         467 non-null object
wifi_id9         467 non-null object
wifi_strong9     467 non-null float64
con_sta9         467 non-null object
wifi_id10        467 non-null object
wifi_strong10    467 non-null float64
con_sta10        467 non-null object
Year             467 non-null int64
Month            467 non-null int64
weekday          467 non-null int64
time             467 non-null object
dtypes: bool(1), float64(10), int64(5), object(22)
memory usage: 139.1+ KB
None
复制代码

咱们能够清晰地看出咱们代码的结构，有多少列，每一列下有多少个值等等，有没有空值咱们能够根据值的数量来判断。咱们在缺失值处理以前加入这个print(train.info())就会获得：

<class 'pandas.core.frame.DataFrame' at 0x000001ECFA6D6718>
RangeIndex: 500 entries, 0 to 499
复制代码

这里面就有500个值，处理后就只剩467个值了，可见丢弃了很多。一样的咱们也能够将test的信息输出一下：

<class 'pandas.core.frame.DataFrame' at 0x0000019E13A96F48>
RangeIndex: 500 entries, 0 to 499
复制代码

500个值一个没少。都给补上了。这里我只取了输出信息的标题，没有全贴过来，由于全信息篇幅很长。咱们注意到这个数据中有bool、float、int、object四种类型，咱们XGBoost是一种回归树，只能处理数字类的数据，因此咱们要转化。对于那些字符串类型的数据咱们该如何处理呢？咱们采用LabelEncoder方法：

for f in train.columns:
    if train[f].dtype=='object':
        if f != 'shop_id':
            print(f)
            lbl = preprocessing.LabelEncoder()
            train[f] = lbl.fit_transform(list(train[f].values))
for f in tests.columns:
    if tests[f].dtype == 'object':
        print(f)
        lbl = preprocessing.LabelEncoder()
        tests[f] = lbl.fit_transform(list(tests[f].values))
复制代码

这段代码的意思是调用sklearn中preprocessing里面的LabelEncoder方法，对数据进行标签编码，做用主要就是使其变成数字类数据，有的进行归一化处理，使其运行更快等等。咱们看这段代码，lbl只是LabelEncoder的简写，lbl = preprocessing.LabelEncoder()，这段代码只有一个代换显得一行不那么长而已，没有实际运行什么。第二句lbl.fit_transform(list(train[f].values))是将train里面的每个值进行编码，咱们在其先后输出一下train[f].values就能够看出来：

print(train[f].values)
train[f] = lbl.fit_transform(list(train[f].values))
print(train[f].values)
复制代码

我加上那一串0和/的目的是分隔开输出数据。咱们获得：

user_id
['u_376' 'u_376' 'u_1041' 'u_1158' 'u_1654' 'u_2733' 'u_2848' 'u_3063'
 'u_3063' 'u_3063' 'u_3604' 'u_4250' 'u_4508' 'u_5026' 'u_5488' 'u_5488'
 'u_5602' 'u_5602' 'u_5602' 'u_5870' 'u_6429' 'u_6429' 'u_6870' 'u_6910'
 'u_7037' 'u_7079' 'u_7869' 'u_8045' 'u_8209']
[ 7  7  0  1  2  3  4  5  5  5  6  8  9 10 11 11 12 12 12 13 14 14 15 16 17
 18 19 20 21]
复制代码

咱们能够看出，LabelEncoder将咱们的str类型的数据转换成数字了。按照它本身的一套标准。对于tests数据，咱们能够看到，我单独将shop_id给避开了。这样处理的缘由就是shop_id是咱们要提交的数据，不能有任何编码行为，必定要保持这种str状态。

接下来须要将train和tests转化成matrix类型，方便XGBoost运算：

feature_columns_to_use = ['Year', 'Month', 'weekday',
'time', 'longitude', 'latitude',
'wifi_id1', 'wifi_strong1', 'con_sta1',
 'wifi_id2', 'wifi_strong2', 'con_sta2',
'wifi_id3', 'wifi_strong3', 'con_sta3',
'wifi_id4', 'wifi_strong4', 'con_sta4',
'wifi_id5', 'wifi_strong5', 'con_sta5',
'wifi_id6', 'wifi_strong6', 'con_sta6',
'wifi_id7', 'wifi_strong7', 'con_sta7',
'wifi_id8', 'wifi_strong8', 'con_sta8',
'wifi_id9', 'wifi_strong9', 'con_sta9',
'wifi_id10', 'wifi_strong10', 'con_sta10',]
train_for_matrix = train[feature_columns_to_use]
test_for_matrix = tests[feature_columns_to_use]
train_X = train_for_matrix.as_matrix()
test_X = test_for_matrix.as_matrix()
train_y = train['shop_id']
复制代码

待训练目标是咱们的shop_id,因此train_y是shop_id。

导入模型生成决策树

gbm = xgb.XGBClassifier(silent=1, max_depth=10, n_estimators=1000, learning_rate=0.05)
gbm.fit(train_X, train_y)
复制代码

这两句其实能够合并成一句，咱们也就是在XGBClassifier里面设定好参数，其全部参数以及其默认值(缺省值)我写在这,内容来自XGBoost源代码：

max_depth=3, 这表明的是树的最大深度，默认值为三层。max_depth越大，模型会学到更具体更局部的样本。

learning_rate=0.1,学习率，也就是梯度提高中乘以的系数，越小，使得降低越慢，但也是降低的越精确。

n_estimators=100,也就是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。通常来讲n_estimators过小，容易欠拟合，n_estimators太大，计算量会太大，而且n_estimators到必定的数量后，再增大n_estimators得到的模型提高会很小，因此通常选择一个适中的数值。默认是100。

silent=True,是咱们训练xgboost树的时候后台要不要输出信息，True表明将生成树的信息都输出。

objective="binary:logistic",这个参数定义须要被最小化的损失函数。最经常使用的值有：

binary:logistic 二分类的逻辑回归，返回预测的几率(不是类别)。

multi:softmax 使用softmax的多分类器，返回预测的类别(不是几率)。在这种状况下，你还须要多设一个参数：num_class(类别数目)。

multi:softprob和multi:softmax参数同样，可是返回的是每一个数据属于各个类别的几率。

nthread=-1, 多线程控制，根据本身电脑核心设，想用几个线程就能够设定几个，若是你想用所有核心，就不要设定，算法会自动识别

`gamma=0,在节点分裂时，只有分裂后损失函数的值降低了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数降低值。这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关，因此是须要调整的。

min_child_weight=1,决定最小叶子节点样本权重和。和GBM的 min_child_leaf 参数相似，但不彻底同样。XGBoost的这个参数是最小样本权重的和，而GBM参数是最小样本总数。这个参数用于避免过拟合。当它的值较大时，能够避免模型学习到局部的特殊样本。可是若是这个值太高，会致使欠拟合。这个参数须要使用CV来调整

max_delta_step=0, 决定最小叶子节点样本权重和。和GBM的 min_child_leaf 参数相似，但不彻底同样。XGBoost的这个参数是最小样本权重的和，而GBM参数是最小样本总数。这个参数用于避免过拟合。当它的值较大时，能够避免模型学习到局部的特殊样本。可是若是这个值太高，会致使欠拟合。这个参数须要使用CV来调整。

subsample=1, 和GBM中的subsample参数如出一辙。这个参数控制对于每棵树，随机采样的比例。减少这个参数的值，算法会更加保守，避免过拟合。可是，若是这个值设置得太小，它可能会致使欠拟合。典型值：0.5-1

colsample_bytree=1, 用来控制每棵随机采样的列数的占比(每一列是一个特征)。典型值：0.5-1

colsample_bylevel=1,用来控制树的每一级的每一次分裂，对列数的采样的占比。其实subsample参数和colsample_bytree参数能够起到类似的做用。

reg_alpha=0,权重的L1正则化项。(和Lasso regression相似)。能够应用在很高维度的状况下，使得算法的速度更快。

reg_lambda=1, 权重的L2正则化项这个参数是用来控制XGBoost的正则化部分的。这个参数越大就越能够惩罚树的复杂度

scale_pos_weight=1,在各种别样本十分不平衡时，把这个参数设定为一个正值，可使

base_score=0.5, 全部实例的初始化预测分数，全局偏置；为了足够的迭代次数，改变这个值将不会有太大的影响。

seed=0, 随机数的种子设置它能够复现随机数据的结果，也能够用于调整参数

数据经过树生成预测结果

predictions = gbm.predict(test_X)
复制代码

将tests里面的数据经过这生成好的模型，得出预测结果。

submission = pd.DataFrame({'row_id': tests['row_id'],
                            'shop_id': predictions})
print(submission)
submission.to_csv("submission.csv", index=False)
复制代码

将预测结果写入到csv文件里。咱们注意写入文件的格式，row_id在前，shop_id在后。index=False的意思是不写入行的名称。改为True就把每一行的行标也写入了。

附录

参考资料

机器学习系列(12)_XGBoost参数调优彻底指南（附Python代码）http://blog.csdn.net/han_xiaoyang/article/details/52665396
Kaggle比赛：泰坦尼克之灾： https://www.kaggle.com/c/titanic

完整代码

import pandas as pd
import xgboost as xgb
from sklearn import preprocessing


train = pd.read_csv(r'D:\mall_location\train.csv')
tests = pd.read_csv(r'D:\mall_location\test.csv')

train['time_stamp'] = pd.to_datetime(pd.Series(train['time_stamp']))
tests['time_stamp'] = pd.to_datetime(pd.Series(tests['time_stamp']))

print(train.info())

train['Year'] = train['time_stamp'].apply(lambda x:x.year)
train['Month'] = train['time_stamp'].apply(lambda x: x.month)
train['weekday'] = train['time_stamp'].apply(lambda x: x.weekday())
train['time'] = train['time_stamp'].dt.time
tests['Year'] = tests['time_stamp'].apply(lambda x: x.year)
tests['Month'] = tests['time_stamp'].apply(lambda x: x.month)
tests['weekday'] = tests['time_stamp'].dt.dayofweek
tests['time'] = tests['time_stamp'].dt.time
train = train.drop('time_stamp', axis=1)
train = train.dropna(axis=0)
tests = tests.drop('time_stamp', axis=1)
tests = tests.fillna(method='pad')
for f in train.columns:
    if train[f].dtype=='object':
        if f != 'shop_id':
            print(f)
            lbl = preprocessing.LabelEncoder()
            train[f] = lbl.fit_transform(list(train[f].values))
for f in tests.columns:
    if tests[f].dtype == 'object':
        print(f)
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(tests[f].values))
        tests[f] = lbl.transform(list(tests[f].values))


feature_columns_to_use = ['Year', 'Month', 'weekday',
'time', 'longitude', 'latitude',
'wifi_id1', 'wifi_strong1', 'con_sta1',
 'wifi_id2', 'wifi_strong2', 'con_sta2',
'wifi_id3', 'wifi_strong3', 'con_sta3',
'wifi_id4', 'wifi_strong4', 'con_sta4',
'wifi_id5', 'wifi_strong5', 'con_sta5',
'wifi_id6', 'wifi_strong6', 'con_sta6',
'wifi_id7', 'wifi_strong7', 'con_sta7',
'wifi_id8', 'wifi_strong8', 'con_sta8',
'wifi_id9', 'wifi_strong9', 'con_sta9',
'wifi_id10', 'wifi_strong10', 'con_sta10',]

big_train = train[feature_columns_to_use]
big_test = tests[feature_columns_to_use]
train_X = big_train.as_matrix()
test_X = big_test.as_matrix()
train_y = train['shop_id']

gbm = xgb.XGBClassifier(silent=1, max_depth=10,
                    n_estimators=1000, learning_rate=0.05)
gbm.fit(train_X, train_y)
predictions = gbm.predict(test_X)

submission = pd.DataFrame({'row_id': tests['row_id'],
                            'shop_id': predictions})
print(submission)
submission.to_csv("submission.csv",index=False)
复制代码