spark xgboost & lightgbm 参数解释

1、spark xgboost 模型

1 xgboost 的默认参数:

xgb 参数参考连接 https://blog.csdn.net/yyy430/article/details/85179638 这个连接整理的比较全,可是这个参数是关于python版本的xgb,spark版本的xgboost默认参数和这个有出入html

1.1 默认参数以下:

/* 默认参数 eta -> 0.3 , gamma -> 0 , maxDepth -> 6, minChildWeight -> 1 ,maxDeltaStep -> 0, growPolicy -> "depthwise" ,maxBins -> 16,// python默认是256 subsample -> 1 ,colsampleBytree -> 1 ,colsampleBylevel -> 1 ,lambda -> 1 ,alpha -> 0 ,treeMethod -> "auto" ,sketchEps -> 0.03, scalePosWeight -> 1.0 ,sampleType -> "uniform" ,normalizeType -> "tree" ,rateDrop -> 0.0 ,skipDrop -> 0.0 ,lambdaBias -> 0 ,treeLimit -> 0 */

2 、spark xgb 模型的参数

val paraMap = List(
                  //参数解释 https://blog.csdn.net/Leo_Sheng/article/details/80852328
                  "eta" -> 0.3f // learning rate
                  ,"gamma" -> 0.1 //用于控制是否后剪枝的参数,越大越保守,通常0.一、0.2这样子。
                  ,"max_depth" -> max_depth
                  ,"num_round"->max_iter
                  ,"objective" -> "binary:logistic"
                  ,"eval_metric"->"auc"
                  ,"growPolicy"->"depthwise"
                  /* 默认:depthwise。控制将新节点添加到树中的方式。 仅当tree_method=hist时才支持。 可选项:depthwise,lossguide。 depthwise:在距离根最近的节点处分割。 lossguide:在损失变化最大的节点处分离 (lightgbm leaf-wise 每次都选取当前损失降低最多的叶节点进行分割使得总体模型的损失降低得更多,也就是lossguide) */
                  ,"maximize_evaluation_metrics"->true
                  ,"subsample"->1
                  ,"sample_type"->"uniform"// 采样算法的类型。“uniform”:统一选择掉落的树木。“weighted”:根据权重选择掉落的树木。
                  ,"alpha"->1
                  ,"lambda"->1.5 //控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
                  ,"colsample_bytree" -> 0.8//生成树时进行的列采样
                  //,"silent" -> 1 //不打印信息 被verbosity替代了,
                  ,"verbosity" -> 3//0 (silent), 1 (warning), 2 (info), 3 (debug).
                  ,"nthread" -> 4 //XGBoost运行时的线程数。缺省值是1
                  ,"max_bin"->max_bin // 分桶的数量,须要和 tree_method ='hist' 联合使用,tree_method使用 hist max_bin才会生效
                  ,"missing"->0.0f
                  //,"checkpoint_interval"->5
                  ,"num_early_stopping_rounds"->10
                  ,"num_workers" -> 8 // 默认为1
                  ,"eval_sets"->Map("valid_data"->dfValid)
            ).toMap
            //https://xgboost.readthedocs.io/en/latest/parameter.html
            val booster= new XGBoostClassifier(paraMap)
                .setFeaturesCol("features")
                .setLabelCol("label")
                .setSeed(42)
            val xgbModel = booster.fit(dfTrain)

lightgbm 参数解释

val booster= new LightGBMClassifier()
                .setNumIterations(max_iter)
                .setMaxDepth(max_depth)
                .setLearningRate(0.05)
                .setBaggingFraction(0.7)
                .setFeatureFraction(0.8)
                .setMaxBin(max_bin)
                //.setNumLeaves(10)
                //.setValidationIndicatorCol() //Indicates whether the row is for training or validation
                .setObjective("binary")
                //.setIsUnbalance(true)
                //.setUseBarrierExecutionMode(true) //Use new barrier execution mode in Beta testing, off by default.
                //参数解释 https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md
                //.setCategoricalSlotIndexes("1e3r")
                //.setCategoricalSlotNames()
                // .setEarlyStoppingRound(10)
                //.setGenerateMissingLabels(true)
                //.setisProvideTrainingMetric(true)
                //.setThresholds(Array(threshold)) //多分类设置这个参数
                .setVerbosity(3) //Verbosity where lt 0 is Fatal, eq 0 is Error, eq 1 is Info, gt 1 is Debug
                .setLambdaL2(1.5)
                .setLambdaL1(1.0)
                .setLabelCol("label")
                .setFeaturesCol("features")