Kaggle竞赛 —— 泰坦尼克号（Titanic）

时间 2019-12-12

标签 kaggle 竞赛 titanic 繁體版

原文原文链接

完整代码见kaggle kernel 或 GitHub

比赛页面：https://www.kaggle.com/c/titanicpython

Titanic大概是kaggle上最受欢迎的项目了，有7000多支队伍参加，多年来诞生了无数关于该比赛的经验分享。正是因为前人们的无私奉献，我才能无痛完成本篇。git

事实上kaggle上的不少kernel都聚焦于某个特定的层面（好比提取某个鲜为人知的特征、使用超复杂的算法、专作EDA画图之类的），固然由于这些做者自己大都是大神级别的，因此平日里喜欢钻研一些奇淫巧技。而我目前阶段更加注重一些总体流程化的方面，所以这篇提供了一个端到端的解决方案。github

关于Titanic，这里先贴一段比赛介绍：算法

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.app

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.dom

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.ide

主要是让参赛选手根据训练集中的乘客数据和存活状况进行建模，进而使用模型预测测试集中的乘客是否会存活。乘客特征总共有11个，如下列出。固然也能够根据状况本身生成新特征，这就是特征工程（feature engineering）要作的事情了。学习

PassengerId => 乘客ID
Pclass => 客舱等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 兄弟姐妹数/配偶数
Parch => 父母数/子女数
Ticket => 船票编号
Fare => 船票价格
Cabin => 客舱号
Embarked => 登船港口

总的来讲Titanic和其余比赛比起来数据量算是很小的了，训练集合测试集加起来总共891+418=1309个。由于数据少，因此很容易过拟合(overfitting)，一些算法如GradientBoostingTree的树的数量就不能太多，须要在调参的时候多加注意。测试

下面我先列出目录，而后挑几个关键的点说明一下：this

数据清洗（Data Cleaning）
探索性可视化（Exploratory Visualization）
特征工程（Feature Engineering）
基本建模&评估（Basic Modeling & Evaluation）
参数调整（Hyperparameters Tuning）
集成方法（Ensemble Methods）

数据清洗（Data Cleaning）

1 full.isnull().sum()

首先来看缺失数据，上图显示Age，Cabin，Embarked，Fare这些变量存在缺失值（Survived是预测值）。其中Embarked和Fare的缺失值较少，能够直接用众数和中位数插补。

Cabin的缺失值较多，能够考虑比较有Cabin数据和无Cabin数据的乘客存活状况。

1 pd.pivot_table(full,index=['Cabin'],values=['Survived']).plot.bar(figsize=(8,5))
2 plt.title('Survival Rate')

上面一张图显示在有Cabin数据的乘客的存活率远高于无Cabin数据的乘客，因此咱们能够将Cabin的有无数据做为一个特征。

Age的缺失值有263个，网上有人说直接根据其余变量用回归模型预测Age的缺失值，我把训练集分红两份测试了一下，效果并很差，多是由于Age和其余变量没有很强的相关性，从下面这张相关系数图也能看得出来。

因此这里采用的的方法是先根据‘Name’提取‘Title’，再用‘Title’的中位数对‘Age‘进行插补：

1 full['Title']=full['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
2 full.Title.value_counts()

Title中的Master主要表明little boy，然而却没有表明little girl的Title，因为小孩的生存率每每较高，因此有必要找出哪些是little girl，再填补Age的缺失值。

先假设little girl都没结婚（通常状况下该假设都成立），因此little girl确定都包含在Miss里面。little boy（Master）的年龄最大值为14岁，因此相应的能够设定年龄小于等于14岁的Miss为little girl。对于年龄缺失的Miss，能够用(Parch!=0)来断定是否为little girl，由于little girl每每是家长陪同上船，不会一我的去。

如下代码建立了“Girl”的Title，并以Title各自的中位数填补Age缺失值。

 1 def girl(aa):
 2     if (aa.Age!=999)&(aa.Title=='Miss')&(aa.Age<=14):
 3         return 'Girl'
 4     elif (aa.Age==999)&(aa.Title=='Miss')&(aa.Parch!=0):
 5         return 'Girl'
 6     else:
 7         return aa.Title
 8 
 9 full['Title']=full.apply(girl,axis=1)
10 
11 Tit=['Mr','Miss','Mrs','Master','Girl','Rareman','Rarewoman']
12 for i in Tit:
13     full.loc[(full.Age==999)&(full.Title==i),'Age']=full.loc[full.Title==i,'Age'].median()

至此，数据中已完好失值。

探索性可视化（Exploratory Visualization）

广泛认为泰坦尼克号中女人的存活率远高于男人，以下图所示：

1 pd.crosstab(full.Sex,full.Survived).plot.bar(stacked=True,figsize=(8,5),color=['#4169E1','#FF00FF'])
2 plt.xticks(rotation=0,size='large')
3 plt.legend(bbox_to_anchor=(0.55,0.9))

下图显示年龄与存活人数的关系，能够看出小于5岁的小孩存活率很高。

客舱等级（Pclass）天然也与存活率有很大关系，下图显示1号仓的存活状况最好，3号仓最差。

1 fig,axes=plt.subplots(2,3,figsize=(15,8))
2 Sex1=['male','female']
3 for i,ax in zip(Sex1,axes):
4     for j,pp in zip(range(1,4),ax):
5         PclassSex=full[(full.Sex==i)&(full.Pclass==j)]['Survived'].value_counts().sort_index(ascending=False)
6         pp.bar(range(len(PclassSex)),PclassSex,label=(i,'Class'+str(j)))
7         pp.set_xticks((0,1))
8         pp.set_xticklabels(('Survived','Dead'))
9         pp.legend(bbox_to_anchor=(0.6,1.1))

特征工程（Feature Engineering）

我将‘Title‘、’Pclass‘，’Parch‘三个变量结合起来画了这张图，以平均存活率的降序排列，而后以80%存活率和50%存活率来划分等级（1，2，3），产生新的’MPPS‘特征。

1 TPP.plot(kind='bar',figsize=(16,10))
2 plt.xticks(rotation=40)
3 plt.axhline(0.8,color='#BA55D3')
4 plt.axhline(0.5,color='#BA55D3')
5 plt.annotate('80% survival rate',xy=(30,0.81),xytext=(32,0.85),arrowprops=dict(facecolor='#BA55D3',shrink=0.05))
6 plt.annotate('50% survival rate',xy=(32,0.51),xytext=(34,0.54),arrowprops=dict(facecolor='#BA55D3',shrink=0.05))

基本建模&评估（Basic Modeling & Evaluation）

选择了7个算法，分别作交叉验证（cross-validation）来评估效果:

K近邻（k-Nearest Neighbors）
逻辑回归（Logistic Regression）
朴素贝叶斯分类器（Naive Bayes classifier）
决策树（Decision Tree）
随机森林（Random Forest）
梯度提高树（Gradient Boosting Decision Tree）
支持向量机（Support Vector Machine）

因为K近邻和支持向量机对数据的scale敏感，因此先进行标准化（standard-scaling）：

1 from sklearn.preprocessing import StandardScaler
2 scaler=StandardScaler()
3 X_scaled=scaler.fit(X).transform(X)
4 test_X_scaled=scaler.fit(X).transform(test_X)

最后的评估结果以下：逻辑回归，梯度提高树和支持向量机的效果相对较好。

1 # used scaled data
2 names=['KNN','LR','NB','Tree','RF','GDBT','SVM']
3 for name, model in zip(names,models):
4     score=cross_val_score(model,X_scaled,y,cv=5)
5     print("{}:{},{}".format(name,score.mean(),score))

接下来能够挑选一个模型进行错误分析，提取该模型中错分类的观测值，寻找其中规律进而提取新的特征，以图提升总体准确率。

用sklearn中的KFold将训练集分为10份，分别提取10份数据中错分类观测值的索引，最后再整合到一块。

1 # extract the indices of misclassified observations
2 rr=[]
3 for train_index, val_index in kf.split(X):
4     pred=model.fit(X.ix[train_index],y[train_index]).predict(X.ix[val_index])
5     rr.append(y[val_index][pred!=y[val_index]].index.values)
6 
7 # combine all the indices
8 whole_index=np.concatenate(rr)
9 len(whole_index)

先查看错分类观测值的总体状况：

下面经过分组分析可发现：错分类的观测值中男性存活率高达83%，女性的存活率则均不到50%，这与咱们以前认为的女性存活率远高于男性不符，可见不论在男性和女性中都存在一些特例，而模型并无从现有特征中学习到这些。

经过进一步分析我最后新加了个名为”MPPS”的特征。

1 full.loc[(full.Title=='Mr')&(full.Pclass==1)&(full.Parch==0)&((full.SibSp==0)|(full.SibSp==1)),'MPPS']=1
2 full.loc[(full.Title=='Mr')&(full.Pclass!=1)&(full.Parch==0)&(full.SibSp==0),'MPPS']=2
3 full.loc[(full.Title=='Miss')&(full.Pclass==3)&(full.Parch==0)&(full.SibSp==0),'MPPS']=3
4 full.MPPS.fillna(4,inplace=True)

参数调整（Hyperparameters tuning）

这部分没什么好说的，选定几个参数用grid search死命调就是了~

1 param_grid={'n_estimators':[100,120,140,160],'learning_rate':[0.05,0.08,0.1,0.12],'max_depth':[3,4]}
2 grid_search=GridSearchCV(GradientBoostingClassifier(),param_grid,cv=5)
3 
4 grid_search.fit(X_scaled,y)
5 
6 grid_search.best_params_,grid_search.best_score_

({'learning_rate': 0.12, 'max_depth': 4, 'n_estimators': 100}, 0.85072951739618408)

经过调参，Gradient Boosting Decision Tree能达到85%的交叉验证准确率，迄今为止最高。

集成方法（Ensemble Methods）

我用了三种集成方法：Bagging、VotingClassifier、Stacking。

调参过的单个算法和Bagging以及VotingClassifier的整体比较以下：

1 names=['KNN','LR','NB','CART','RF','GBT','SVM','VC_hard','VC_soft','VCW_hard','VCW_soft','Bagging']
2 for name,model in zip(names,models):
3     score=cross_val_score(model,X_scaled,y,cv=5)
4     print("{}: {},{}".format(name,score.mean(),score))

scikit-learn中目前没有stacking的实现方法，因此我参照了这两篇文章中的实现方法:

https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

我用了逻辑回归、K近邻、支持向量机、梯度提高树做为第一层模型，随机森林做为第二层模型。

 1 from sklearn.model_selection import StratifiedKFold
 2 n_train=train.shape[0]
 3 n_test=test.shape[0]
 4 kf=StratifiedKFold(n_splits=5,random_state=1,shuffle=True)  
 5 
 6 def get_oof(clf,X,y,test_X):
 7     oof_train=np.zeros((n_train,))
 8     oof_test_mean=np.zeros((n_test,))
 9     oof_test_single=np.empty((5,n_test))
10     for i, (train_index,val_index) in enumerate(kf.split(X,y)):
11         kf_X_train=X[train_index]
12         kf_y_train=y[train_index]
13         kf_X_val=X[val_index]
14         
15         clf.fit(kf_X_train,kf_y_train)
16         
17         oof_train[val_index]=clf.predict(kf_X_val)
18         oof_test_single[i,:]=clf.predict(test_X)
19     oof_test_mean=oof_test_single.mean(axis=0)
20     return oof_train.reshape(-1,1), oof_test_mean.reshape(-1,1)
21 
22 LR_train,LR_test=get_oof(LogisticRegression(C=0.06),X_scaled,y,test_X_scaled)
23 KNN_train,KNN_test=get_oof(KNeighborsClassifier(n_neighbors=8),X_scaled,y,test_X_scaled)
24 SVM_train,SVM_test=get_oof(SVC(C=4,gamma=0.015),X_scaled,y,test_X_scaled)
25 GBDT_train,GBDT_test=get_oof(GradientBoostingClassifier(n_estimators=120,learning_rate=0.12,max_depth=4),X_scaled,y,test_X_scaled)
26 
27 stack_score=cross_val_score(RandomForestClassifier(n_estimators=1000),X_stack,y_stack,cv=5)
28 # cross-validation score of stacking
29 stack_score.mean(),stack_score

Stacking的最终结果：

0.84069254167070062, array([ 0.84916201, 0.79888268, 0.85393258, 0.83707865, 0.86440678]))

总的来讲根据交叉验证的结果，集成算法并无比单个算法提高太多，缘由多是：

开头所说Titanic这个数据集过小，模型没有获得充分的训练
集成方法中子模型的相关性太强
集成方法可能自己也须要调参
我实现的方法错了？？？

最后是提交结果：

1 pred=RandomForestClassifier(n_estimators=500).fit(X_stack,y_stack).predict(X_test_stack)
2 tt=pd.DataFrame({'PassengerId':test.PassengerId,'Survived':pred})
3 tt.to_csv('submission.csv',index=False)