R语言机器学习框架h2o基础学习教程

时间 2019-11-09

标签语言机器学习框架 h2o 基础教程繁體版

原文原文链接

h2o高性能机器学习框架教程

本文为2016年H2O Open Chicago上的内容。html

译者注：

在使用H2O前你须要：java

安装java环境(需下载64位JDK，否则在R中不能控制经过h2o.init()函数来控制内存)
install.packages("h2o")

h2o相似于python中的sklearn,提供各类机器学习算法接口，咱们须要此类框架的理由：python

提供统一的接口，代码更加清晰简单
不须要一个模型一个数据格式
计算速度较快

在R中推荐使用data.table包进行数据清洗，而后使用as.h2o变成h2o包所接受的格式，再用h2o包进行数据建模。ios

上面都是个人废话~~我的经验R使用经验~~。git

下面开始教程：

首先加载h2o包，并在你本地机器上开启h2o集群github

library(h2o)
h2o.init(nthreads = -1, #-1表示使用你机器上全部的核
         max_mem_size = "8G")  #max_mem_size参数表示容许h2o使用的最大内存

下面咱们来导入一个已经通过数据清理的关于贷款的一个数据集，咱们的目的是来预测这个贷款是否能按时偿还（二分类问题），响应变量bad_loan，1表示未能偿还，0表示已经偿还。算法

loan_csv <- "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data <- h2o.importFile(loan_csv)   #能够直接从一个URL中导入数据
dim(data) # 163,987 rows x 15 columns

因为咱们是一个二分类问题，咱们必须指定响应变量为一个因子类型(factor)，若响应变量为0/1,H2O会认为他是一个数值，那将意味着H2O会训练一个回归模型网络

data$bad_loan <- as.factor(data$bad_loan)  #编码为因子类型
h2o.levels(data$bad_loan)  #查看因子levels

下面我将数据拆分红为训练集，验证集与测试集，app

splits <- h2o.splitFrame(data = data, 
                         ratios = c(0.7, 0.15),  #训练集，验证集与测试集比例分别为70%, 15%, 15%
                         seed = 1)  #setting a seed will guarantee reproducibility
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]

咱们来看下数据各部分的大小，注意h2o.splitFrame函数为了运行效率采用的是近似拆分方法而不是精确拆分，故你会发现数据大小不是精确的70%, 15%与15%框架

nrow(train)  # 114908
nrow(valid) # 24498
nrow(test)  # 24581

指定因变量与自变量

y <- "bad_loan"
x <- setdiff(names(data), c(y, "int_rate"))  
print(x)
# [1] "loan_amnt"             "term"                 
# [3] "emp_length"            "home_ownership"       
# [5] "annual_inc"            "verification_status"  
# [7] "purpose"               "addr_state"           
# [9] "dti"                   "delinq_2yrs"          
# [11] "revol_util"            "total_acc"            
# [13] "longest_credit_length"

咱们已经将数据准备工做完成(译者注：在实际应用中咱们须要大量的时间进行特征工程工做，因为本文是为了教授如何建模，故直接使用原始数据)，下面咱们将训练几个模型，主要的模型包括H20监督算法的：

广义线性回归模型 (GLM)
随机森林模型(RF)
GBM(也称GBDT)
深度学习(DL)
朴素贝叶斯(NB)

1.广义线性回归模型 (GLM)

让咱们从一个基本的二元广义线性回归开始：默认状况下h2o.glm采用一个带正则项的弹性网模型(elastic net model)

glm_fit1 <- h2o.glm(x = x, 
                    y = y, 
                    training_frame = train,
                    model_id = "glm_fit1",
                    family = "binomial")  #与R中`glm`相似，`h2o.glm`有一个family参数

下面咱们将经过验证集来进行一些自动调参工做，须要设置lambda_search = True。由于咱们的GLM模型是带正则项的，因此咱们须要找到一个合适的正则项大小来防止过拟合。这个模型参数lambda是控制GLM模型的正则项大小，经过设定lambda_search = TRUE 咱们能自动找到一个lambda 的最优值，这个自动寻找的方法是经过在验证集上指定一个lambda,验证集上的最优lambda即咱们要找的lambda

glm_fit2 <- h2o.glm(x = x, 
                    y = y, 
                    training_frame = train,
                    model_id = "glm_fit2",
                    family = "binomial",
                    validation_frame = valid,
                    lambda_search = TRUE)

让咱们在测试集上看下2个GLM模型的表现：

glm_perf1 <- h2o.performance(model = glm_fit1,
                             newdata = test)
glm_perf2 <- h2o.performance(model = glm_fit2,
                             newdata = test)

若是你不想输出模型所有的评测对象，咱们也只输出你想要的那个评测

h2o.auc(glm_perf1)  
h2o.auc(glm_perf2)

比较测试集训练集验证集上的AUC

h2o.auc(glm_fit2, train = TRUE)  
h2o.auc(glm_fit2, valid = TRUE)  
glm_fit2@model$validation_metrics

2.随机森林模型(RF)

H2O的随机森林算法实现了标准随机森林算法的分布式版本和变量重要性的度量，首先咱们使用一个默认参数来训练一个基础的随机森林模型。随机森林模型将从因变量编码推断因变量分布。

rf_fit1 <- h2o.randomForest(x = x,
                            y = y,
                            training_frame = train,
                            model_id = "rf_fit1",
                            seed = 1) # 设置随机数以便结果复现.

下面咱们经过设置参数ntrees = 100来增长树的大小，在H2O中树的默认大小为50。一般来讲增长树的大小RF的表现会更好。相比较GBM模型，RF一般更加不易过拟合。在下面的GBM例子中你将会看到咱们须要额外的设置early stopping来防止过拟合。

rf_fit2 <- h2o.randomForest(x = x,
                            y = y,
                            training_frame = train,
                            model_id = "rf_fit2",
                            #validation_frame = valid,  #only used if stopping_rounds > 0
                            ntrees = 100,
                            seed = 1)

# 比较2个RF模型的性能
rf_perf1 <- h2o.performance(model = rf_fit1,
                            newdata = test)
rf_perf2 <- h2o.performance(model = rf_fit2,
                            newdata = test)
rf_perf1
rf_perf2

# 提取测试集AUC
h2o.auc(rf_perf1)  
h2o.auc(rf_perf2)

交叉验证

有时咱们会不设定验证集，而直接使用交叉验证来看模型的表现。下面咱们将使用随机森林做为例子，来展现使用H2O进行交叉验证。你不须要自定义代码或循环，您只需在nfolds参数中指定所需折的数量。注意k-折交叉验证将会训练k个模型，故时间是原来额k倍

rf_fit3 <- h2o.randomForest(x = x,
                            y = y,
                            training_frame = train,
                            model_id = "rf_fit3",
                            seed = 1,
                            nfolds = 5)

# 评估交叉训练的auc
h2o.auc(rf_fit3, xval = TRUE)

3. Gradient Boosting Machine(gbdt/gbm)

H2O的GBM提供了一个随机GBM,向较原始的GBM会有一点性能上的提高。如今咱们来训练一个基础的GBM模型。

若是没有经过distribution参数明确指定，则GBM模型将从因变量编码推断因变量分布。

gbm_fit1 <- h2o.gbm(x = x,
                    y = y,
                    training_frame = train,
                    model_id = "gbm_fit1",
                    seed = 1)# 设置随机数以便结果复现.

下面咱们将经过设置ntrees=500来增长GBM中树的数量。H2O中默认树的数量为50，因此此次GBM的运行时间会是默认状况的10倍。增长树的个数是一般会提升模型的性能，可是你必须当心，使用那么多树有可能会致使过拟合。你能够经过设置

early stopping来自动寻找最优的树的个数。在后面的例子中咱们会讨论 early stopping.

gbm_fit2 <- h2o.gbm(x = x,
                    y = y,
                    training_frame = train,
                    model_id = "gbm_fit2",
                    #validation_frame = valid,  #only used if stopping_rounds > 0
                    ntrees = 500,
                    seed = 1)

下面咱们仍然会设置ntrees = 500，但此次咱们会设置early stopping来防止过拟合。全部的H2O算法都提供early stopping然而在默认状况下是不启用的(除了深度学习算法)。这里有几个参数设置来控制early stopping，全部参数共有以下3个参数：stopping_rounds, stopping_metric and stopping_tolerance.

stopping_metric参数是你的评测函数，在这里咱们使用AUC。

score_tree_interval参数是随机森林和GBM的特有参数。设置score_tree_interval = 5将在每五棵树以后计算得分。

咱们下面设置的参数指定模型将在三次评分间隔后中止训练，若AUC增长没有超过0.0005。

因为咱们指定了一个验证集，因此将在验证集上计算AUC的stopping_tolerance，而不是训练集AUC

gbm_fit3 <- h2o.gbm(x = x,
                    y = y,
                    training_frame = train,
                    model_id = "gbm_fit3",
                    validation_frame = valid,  #only used if stopping_rounds > 0
                    ntrees = 500,
                    score_tree_interval = 5,      #used for early stopping
                    stopping_rounds = 3,          #used for early stopping
                    stopping_metric = "AUC",      #used for early stopping
                    stopping_tolerance = 0.0005,  #used for early stopping
                    seed = 1)


# GBM性能比较
gbm_perf1 <- h2o.performance(model = gbm_fit1,
                             newdata = test)
gbm_perf2 <- h2o.performance(model = gbm_fit2,
                             newdata = test)
gbm_perf3 <- h2o.performance(model = gbm_fit3,
                             newdata = test)
gbm_perf1
gbm_perf2
gbm_perf3

# 提取测试集AUC
h2o.auc(gbm_perf1)  
h2o.auc(gbm_perf2)  
h2o.auc(gbm_perf3)

为了检查评分历史，请在已经训练的模型上使用scoring_history方法，若不指定，它会计算不一样间隔的得分，请参阅下面h2o.scoreHistory()。gbm_fit2只使用了训练集没有使用验证集，故只对训练集来计算模型的历史得分。

只有使用训练集（无验证集）对“gbm_fit2”进行训练，所以仅为训练集绩效指标计算得分记录。

h2o.scoreHistory(gbm_fit2)

当使用early stopping时，咱们发现咱们只使用了95棵树而不是所有的500棵。因为咱们在gbm_fit3中使用了验证集，训练集与验证集的历史得分都被存储了下来。咱们来观察验证集的AUC，以确认stopping tolerance是否被强制执行。

h2o.scoreHistory(gbm_fit3)
# 查看下这个模型的历史得分
plot(gbm_fit3, 
     timestep = "number_of_trees", 
     metric = "AUC")
plot(gbm_fit3, 
     timestep = "number_of_trees", 
     metric = "logloss")

4.深度学习(DL)

H2O的深度学习算法是多层前馈人工神经网络，它也能够用于训练自动编码器。在这个例子中，咱们将训练一个标准的监督预测模型。

首先咱们将使用默认参数训练一个基础深度学习模型，若是没有经过distribution参数明确指定，则DL模型将从因变量编码推断因变量分布。若H2O的DL算法运行在多核上，那么H2O的DL算法将没法重现。因此在这个例子中，下面的性能指标可能与你在机器上看到的不一样。在H2O的DL中，默认状况下启用early stopping，因此下面的训练集中将会默认使用early stopping参数来进行early stopping。

dl_fit1 <- h2o.deeplearning(x = x,
                            y = y,
                            training_frame = train,
                            model_id = "dl_fit1",
                            seed = 1)

用新的结构和更多的epoch训练DL。下面咱们经过设置epochs=20来增长epochs,默认为10。

一般来讲增长epochs深度神经网络的表现会更好。相比较GBM模型，RF一般更加不易过拟合。在下面的GBM例子中你将会看到咱们须要额外的设置early stopping来防止过拟合。可是你必须当心，不要过拟合你的数据。你能够经过设置

early stopping来自动寻找最优的epochs数。与其余H2O中的算法不一样，H2O的深度学习算法会默认使用early stopping 因此为了比较咱们先不使用 early stopping，经过设置stopping_rounds=0。

dl_fit2 <- h2o.deeplearning(x = x,
                            y = y,
                            training_frame = train,
                            model_id = "dl_fit2",
                            #validation_frame = valid,  #only used if stopping_rounds > 0
                            epochs = 20,
                            hidden= c(10,10),
                            stopping_rounds = 0,  # 禁用 early stopping
                            seed = 1)

使用 early stopping来训练DL模型。

此次咱们会使用跟 dl_fit2相同的参数，并在这基础上加上early stopping。经过验证集来进行early stopping。

dl_fit3 <- h2o.deeplearning(x = x,
                            y = y,
                            training_frame = train,
                            model_id = "dl_fit3",
                            validation_frame = valid,  #in DL, early stopping is on by default
                            epochs = 20,
                            hidden = c(10,10),
                            score_interval = 1,           #used for early stopping
                            stopping_rounds = 3,          #used for early stopping
                            stopping_metric = "AUC",      #used for early stopping
                            stopping_tolerance = 0.0005,  #used for early stopping
                            seed = 1)


#比较一下这3个模型
dl_perf1 <- h2o.performance(model = dl_fit1,
                            newdata = test)
dl_perf2 <- h2o.performance(model = dl_fit2,
                            newdata = test)
dl_perf3 <- h2o.performance(model = dl_fit3,
                            newdata = test)
dl_perf1
dl_perf2
dl_perf3
# 提取验证集AUC
h2o.auc(dl_perf1)  
h2o.auc(dl_perf2)  
h2o.auc(dl_perf3)  
# 计算历史得分
h2o.scoreHistory(dl_fit3)
# 查看第三个DL模型的历史得分
plot(dl_fit3, 
     timestep = "epochs", 
     metric = "AUC")

5. 朴素贝叶斯(NB)

朴素贝叶斯算法(NB)在效果上一般会比RF与GBM差，但它仍然是一个受欢迎的算法，尤为在文本领域(例如，当您的输入文本被编码为“词袋”时~~"Bag of Words"~~)。朴素贝叶斯算法只能用做二分类与多分类任务，不能用做回归。所以响应变量必须是因子类型，不能是数值类型。

首先咱们使用默认参数来训练一个基础的NB模型。

nb_fit1 <- h2o.naiveBayes(x = x,
                          y = y,
                          training_frame = train,
                          model_id = "nb_fit1")

下面咱们使用拉普拉斯平滑来训练NB模型。朴素贝叶斯算法的几个可调模型参数之一是拉普拉斯平滑的量。默认状况下不会使用拉普拉斯平滑。

nb_fit2 <- h2o.naiveBayes(x = x,
                          y = y,
                          training_frame = train,
                          model_id = "nb_fit2",
                          laplace = 6)

# 比较2个NB模型
nb_perf1 <- h2o.performance(model = nb_fit1,
                            newdata = test)
nb_perf2 <- h2o.performance(model = nb_fit2,
                            newdata = test)
nb_perf1
nb_perf2
# 提取测试集 AUC
h2o.auc(nb_perf1)  
h2o.auc(nb_perf2)