Kaggle竞赛介绍: Home Credit default risk(五)

请点击上方“AI公园”,关注公众号python


本文选自Kaggle
微信

做者:Will Koehrsen
app

编译:ronghuaiyangdom

Kaggle的信用卡违约风险预测竞赛,很是有参考价值,作风控和大数据挖掘的同窗能够参考一下,很是详细,很是适合入门,从数据处理到模型的构建,很是全面,文章比较长,分几回发出来,这是第四部分,主要是总结和LightGBM的介绍。机器学习

结论

本文中,咱们展现了如何开始一个Kaggle的机器学习竞赛。咱们须要首先理解数据,咱们的任务,还有评估的方法。而后咱们会进行EDA来观察数据相互之间的关系,趋势或者异常值等来帮助咱们建模。在这个过程当中,咱们会使用类别特征编码,插补缺失值,缩放数据到一个范围。咱们还会从已有的数据找那个建立新的特征,看看是否是对模型有帮助。ide

一旦数据都准备好了,特征工程也作好了。咱们实现了一个基线模型,而后咱们再建立更加复杂的模型来战胜咱们的基线模型。咱们还作了实验,看看咱们加入的那些新的变量的效果。学习

咱们按照通用的机器学习的大纲:大数据

  1. 理解问题和数据优化

  2. 数据清洗和格式化ui

  3. 探索性数据分析

  4. 基线模型

  5. 提高模型

  6. 模型解释

机器学习竞赛对于不一样的问题差异经常不大,咱们通常只关注如何取得最佳的结果,不太关注解释性。可是,经过理解咱们的模型作的预测的方式,咱们能够经过纠正犯错的样原本提高咱们的模型。将来咱们会创建更复杂的模型,观察更多的数据,提升咱们的分数。

接下来的内容

接下来我会持续的优化这个项目,下面是后续的一些内容

  • 手工特征第一部分

  • 手工特征第二部分

  • 自动特征工程介绍

  • 高级自动特征工程

  • 特征选择

  • 模型超参数调试介绍: 网格和随机搜索

欢迎你们反馈!

玩一玩: Light Gradient Boosting Machine

如今,咱们能够试试真正的机器学习模型:LightGBM中的gradient boost machine。这个方法是目前在结构化数据上使用最早进的模型,特别是Kaggle中。尽管代码看起来比较吓人,其实就是建立模型的一些步骤。我加了这段代码是想展现一下这个项目还有多少潜力可挖,使用这个方法能够拿到好一点的分数。将来咱们会看到使用更多的高级模型,使用更多的特征工程,特征选择。

In [55]:

from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc

def model(features, test_features, encoding = 'ohe', n_folds = 5):
   
   """Train and test a light gradient boosting model using
   cross validation.
   
   Parameters
   --------
       features (pd.DataFrame):
           dataframe of training features to use
           for training a model. Must include the TARGET column.
       test_features (pd.DataFrame):
           dataframe of testing features to use
           for making predictions with the model.
       encoding (str, default = 'ohe'):
           method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
           n_folds (int, default = 5): number of folds to use for cross validation
       
   Return
   --------
       submission (pd.DataFrame):
           dataframe with `SK_ID_CURR` and `TARGET` probabilities
           predicted by the model.
       feature_importances (pd.DataFrame):
           dataframe with the feature importances from the model.
       valid_metrics (pd.DataFrame):
           dataframe with training and validation metrics (ROC AUC) for each fold and overall.
       
   """
   
   # Extract the ids
   train_ids = features['SK_ID_CURR']
   test_ids = test_features['SK_ID_CURR']
   
   # Extract the labels for training
   labels = features['TARGET']
   
   # Remove the ids and target
   features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
   test_features = test_features.drop(columns = ['SK_ID_CURR'])
   
   
   # One Hot Encoding
   if encoding == 'ohe':
       features = pd.get_dummies(features)
       test_features = pd.get_dummies(test_features)
       
       # Align the dataframes by the columns
       features, test_features = features.align(test_features, join = 'inner', axis = 1)
       
       # No categorical indices to record
       cat_indices = 'auto'
   
   # Integer label encoding
   elif encoding == 'le':
       
       # Create a label encoder
       label_encoder = LabelEncoder()
       
       # List for storing categorical indices
       cat_indices = []
       
       # Iterate through each column
       for i, col in enumerate(features):
           if features[col].dtype == 'object':
               # Map the categorical features to integers
               features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
               test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

               # Record the categorical indices
               cat_indices.append(i)
   
   # Catch error if label encoding scheme is not valid
   else:
       raise ValueError("Encoding must be either 'ohe' or 'le'")
       
   print('Training Data Shape: ', features.shape)
   print('Testing Data Shape: ', test_features.shape)
   
   # Extract feature names
   feature_names = list(features.columns)
   
   # Convert to np arrays
   features = np.array(features)
   test_features = np.array(test_features)
   
   # Create the kfold object
   k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
   
   # Empty array for feature importances
   feature_importance_values = np.zeros(len(feature_names))
   
   # Empty array for test predictions
   test_predictions = np.zeros(test_features.shape[0])
   
   # Empty array for out of fold validation predictions
   out_of_fold = np.zeros(features.shape[0])
   
   # Lists for recording validation and training scores
   valid_scores = []
   train_scores = []
   
   # Iterate through each fold
   for train_indices, valid_indices in k_fold.split(features):
       
       # Training data for the fold
       train_features, train_labels = features[train_indices], labels[train_indices]
       # Validation data for the fold
       valid_features, valid_labels = features[valid_indices], labels[valid_indices]
       
       # Create the model
       model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary',
                                  class_weight = 'balanced', learning_rate = 0.05,
                                  reg_alpha = 0.1, reg_lambda = 0.1,
                                  subsample = 0.8, n_jobs = -1, random_state = 50)
       
       # Train the model
       model.fit(train_features, train_labels, eval_metric = 'auc',
                 eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                 eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                 early_stopping_rounds = 100, verbose = 200)
       
       # Record the best iteration
       best_iteration = model.best_iteration_
       
       # Record the feature importances
       feature_importance_values += model.feature_importances_ / k_fold.n_splits
       
       # Make predictions
       test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
       
       # Record the out of fold predictions
       out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
       
       # Record the best score
       valid_score = model.best_score_['valid']['auc']
       train_score = model.best_score_['train']['auc']
       
       valid_scores.append(valid_score)
       train_scores.append(train_score)
       
       # Clean up memory
       gc.enable()
       del model, train_features, valid_features
       gc.collect()
       
   # Make the submission dataframe
   submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
   
   # Make the feature importance dataframe
   feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
   
   # Overall validation score
   valid_auc = roc_auc_score(labels, out_of_fold)
   
   # Add the overall scores to the metrics
   valid_scores.append(valid_auc)
   train_scores.append(np.mean(train_scores))
   
   # Needed for creating dataframe of validation scores
   fold_names = list(range(n_folds))
   fold_names.append('overall')
   
   # Dataframe of validation scores
   metrics = pd.DataFrame({'fold': fold_names,
                           'train': train_scores,
                           'valid': valid_scores})
   
   return submission, feature_importances, metrics

In [56]:

submission, fi, metrics = model(app_train, app_test)
print('Baseline metrics')
print(metrics)
Training Data Shape:  (307511, 239)
Testing Data Shape:  (48744, 239)
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.754949   train's auc: 0.79887
Early stopping, best iteration is:
[208]   valid's auc: 0.755109   train's auc: 0.80025
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.758539   train's auc: 0.798518
Early stopping, best iteration is:
[217]   valid's auc: 0.758619   train's auc: 0.801374
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.762652   train's auc: 0.79774
[400]   valid's auc: 0.762202   train's auc: 0.827288
Early stopping, best iteration is:
[320]   valid's auc: 0.763103   train's auc: 0.81638
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.757496   train's auc: 0.799107
Early stopping, best iteration is:
[183]   valid's auc: 0.75759    train's auc: 0.796125
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.758099   train's auc: 0.798268
Early stopping, best iteration is:
[227]   valid's auc: 0.758251   train's auc: 0.802746
Baseline metrics
     fold     train     valid
0        0  0.800250  0.755109
1        1  0.801374  0.758619
2        2  0.816380  0.763103
3        3  0.796125  0.757590
4        4  0.802746  0.758251
5  overall  0.803375  0.758537

In [57]:

fi_sorted = plot_feature_importances(fi)

In [58]:

submission.to_csv('baseline_lgb.csv', index = False)

这个提交应该获得0.735的分数,将来咱们会获得更高的分数。

In [59]:

app_train_domain['TARGET'] = train_labels

# Test the domain knolwedge features
submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
print('Baseline with domain knowledge features metrics')
print(metrics_domain)
Training Data Shape:  (307511, 243)
Testing Data Shape:  (48744, 243)
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.762577   train's auc: 0.804531
Early stopping, best iteration is:
[237]   valid's auc: 0.762858   train's auc: 0.810671
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.765594   train's auc: 0.804304
Early stopping, best iteration is:
[227]   valid's auc: 0.765861   train's auc: 0.808665
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.770139   train's auc: 0.803753
[400]   valid's auc: 0.770328   train's auc: 0.834338
Early stopping, best iteration is:
[302]   valid's auc: 0.770629   train's auc: 0.820401
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.765653   train's auc: 0.804487
Early stopping, best iteration is:
[262]   valid's auc: 0.766318   train's auc: 0.815066
Training until validation scores don't improve for 100 rounds.
[200]   valid's auc: 0.764456   train's auc: 0.804527
Early stopping, best iteration is:
[235]   valid's auc: 0.764517   train's auc: 0.810422
Baseline with domain knowledge features metrics
     fold     train     valid
0        0  0.810671  0.762858
1        1  0.808665  0.765861
2        2  0.820401  0.770629
3        3  0.815066  0.766318
4        4  0.810422  0.764517
5  overall  0.813045  0.766050

In [60]:

fi_sorted = plot_feature_importances(fi_domain)

咱们再一次看到了咱们以前选出来的特征的重要性。看到这个,咱们也许会想,领域特征是否是用这个方法也能起到做用。

In [61]:

submission_domain.to_csv('baseline_lgb_domain_features.csv', index = False)

此次,咱们的模型的分数是0.754,这说明领域特征对模型的提高仍是有效果的,特征工程确实是很是重要的部分。(全部的机器学习问题都是如此!)

本文能够任意转载,转载时请注明做者及原文地址。


请长按或扫描二维码关注咱们


本文分享自微信公众号 - AI公园(AI_Paradise)。
若有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一块儿分享。

相关文章
相关标签/搜索