本文以Kaggle比赛Titanic入手,介绍了特征工程的几个方法,最后训练了三个模型(RF,GBDT,SVM)并使用了一个集成方法(Voting Classifier)进行预测。html
完整代码及数据能够在ReMachineLearning(titanic) - Github中获取git
下面是kaggle对于这个比赛的介绍。github
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.算法
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.bash
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.数据结构
简单来讲就是已知乘客的各类信息,去推断这名乘客是否会幸存的一个二分类问题。app
本文首先使用特征工程的一些方法,而后使用随机森林(Random Forests),GBDT(Gradient Boosting Decision Tree)和SVM(Support Vector Machine)做为训练模型,最后用了投票分类器(Voting Classifier)集成方法作了模型的集成。dom
本文基于Python 三、sklearn以及Pandas(强烈建议按照Anaconda),数据源于Kaggle,代码以及数据均可以在Github上获取机器学习
先简单介绍下给的训练数据的csv结构ide
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
复制代码
从数据里面能够看到数据中性别(Sex)是一个枚举字符串male
或female
,为了让计算机更好的处理这列数据,咱们将其数值化处理
# API文档 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html
# map方法数字化处理并改变pandas的列类型
df['Sex'] = df['Sex'].map({'female': 1, 'male': 0}).astype(int)
复制代码
用df.info()
能够简单的发现哪类数据是不全的
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null int64
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(6), object(4)
memory usage: 83.6+ KB
复制代码
能够发现Age、Cabin,Embarked信息是不全的,这时候咱们须要对这些进行补全。对于机器学习如何处理缺失数据的理论学习能够参考连接中的内容。
在这里例子里面个人解决方法是:
Sibsp
指的是一同乘船的兄弟姐妹或配偶,Parch
指的是一同乘船的父母或儿女。
添加familysize列做为sibsp和parch的和,表明总的家庭成员数。 这种从原始数据中挖掘隐藏属性或属性组合的方法叫作派生属性。可是这些挖掘出来的特征有几个问题:
其实这里我有些没有理解为何要把Sibsp和Parch相加为Familysize,我无法解释这样作的合理性。
Fare
指的是票价
拿到数据后咱们首先要作的是分析数据及特征。
pandas能够经过DataFrame的corr方法计算数据的相关系数(方法包括三种:pearson,kendall,spearman),这里也不扩展,反正可以计算出数据之间的相关性。
# API文档 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html
def plot_corr(df,size=10):
'''Function plots a graphical correlation matrix for each pair of columns in the dataframe. Input: df: pandas DataFrame size: vertical and horizontal size of the plot'''
corr = df.corr()
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr)
for (i, j), z in np.ndenumerate(corr):
ax.text(j, i, '{:.2f}'.format(z), ha='center', va='center')
plt.xticks(range(len(corr.columns)), corr.columns)
plt.yticks(range(len(corr.columns)), corr.columns)
# 特征相关性图表
plot_corr(df)
复制代码
以前也说到了咱们会用到三个模型以及一个集成方法,这块没有什么好细说的。
使用了随机森林(Random Forests),GBDT(Gradient Boosting Decision Tree)和SVM(Support Vector Machine)做为训练模型,最后用了投票分类器(Voting Classifier)集成方法作了模型的集成
使用Voting以及单个模型的得分以下:
这篇博文里面涉及的算法每个我后期都会单独写一篇博文(坑我先挖了),这篇博客从开始写到如今差很少2个月了=,=,中间断了一次如今才写完。