有幸参加了阿里云举办的零基础入门金融风控-贷款违约预测训练营。收获颇多。python
天天记录一些本身以前的知识盲点,需常常温习。app
第三次的学习任务,是特征工程。在数据科学领域,有句话说得好:“特征工程决定了模型的上限。”可见其重要之处。post
先查看一下原始缺失值状况:学习
# 查看缺失值状况 train.isnull().sum() # 打印结果: id 0 loanAmnt 0 term 0 interestRate 0 installment 0 grade 0 subGrade 0 employmentTitle 1 employmentLength 46799 homeOwnership 0 annualIncome 0 verificationStatus 0 issueDate 0 isDefault 0 purpose 0 postCode 1 regionCode 0 dti 239 delinquency_2years 0 ficoRangeLow 0 ficoRangeHigh 0 openAcc 0 pubRec 0 pubRecBankruptcies 405 revolBal 0 revolUtil 531 totalAcc 0 initialListStatus 0 applicationType 0 earliesCreditLine 0 title 1 policyCode 0 n0 40270 n1 40270 n2 40270 n4 33239 n5 40270 n6 40270 n7 40270 n8 40271 n9 40270 n10 33239 n11 69752 n12 40270 n13 40270 n14 40270 dtype: int64
对于数值型特征来讲,通常采用平均数进行填充;对于类别型特征来讲,通常采用众数进行填充:测试
# 按照平均数填充数值型特征 train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median()) testA[numerical_fea] = testA[numerical_fea].fillna(train[numerical_fea].median()) # 按照众数填充类别型特征 train[category_fea] = train[category_fea].fillna(train[category_fea].mode()) testA[category_fea] = testA[category_fea].fillna(train[category_fea].mode())
注意:此时不管对于训练集仍是测试集,均应该采用训练集的平均数(或众数)来进行填充。目的是对缺失值采起相同的对待方式。若是训练集采用训练集的平均数(或众数)进行填充,测试集采用测试集的平均数(或众数)来进行填充的话,将改变分布特征。 阿里云
此时再次查看打印结果:rest
train.isnull().sum() # 打印结果 id 0 loanAmnt 0 term 0 interestRate 0 installment 0 grade 0 subGrade 0 employmentTitle 0 employmentLength 46799 homeOwnership 0 annualIncome 0 verificationStatus 0 issueDate 0 isDefault 0 purpose 0 postCode 0 regionCode 0 dti 0 delinquency_2years 0 ficoRangeLow 0 ficoRangeHigh 0 openAcc 0 pubRec 0 pubRecBankruptcies 0 revolBal 0 revolUtil 0 totalAcc 0 initialListStatus 0 applicationType 0 earliesCreditLine 0 title 0 policyCode 0 n0 0 n1 0 n2 0 n4 0 n5 0 n6 0 n7 0 n8 0 n9 0 n10 0 n11 0 n12 0 n13 0 n14 0 dtype: int64
能够看出,只剩下employmentLength这一列没有进行处理了。因为这一列比较复杂,含有“5 years”、“10+ years”、‘< 1 year’和空值,故处理起来比较困难。code
使用pandas自带的处理时间序列的方法来进行处理。orm
# 转化成时间格式 for data in [train, testA]: data['issueDate'] = pd.to_datetime(data['issueDate'], format='%Y-%m-%d') startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') # 构造时间特征 data['issueDateDT'] = data['issueDate'].apply(lambda x : x-startdate).dt.days
此时再来看一下处理后的效果:对象
train['employmentLength'].value_counts(dropna=False).sort_index() # 打印结果 1 year 52489 10+ years 262753 2 years 72358 3 years 64152 4 years 47985 5 years 50102 6 years 37254 7 years 35407 8 years 36192 9 years 30272 < 1 year 64237 NaN 46799 Name: employmentLength, dtype: int64
def employmentLength_to_int(s): if pd.isnull(s): return s else: return np.int8(s.split()[0])
for data in [train, testA]: data['employmentLength'].replace('10+ years', '10 years', inplace=True) data['employmentLength'].replace('< 1 year', '0 years', inplace=True) data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
此时再来看一下处理后的效果:
data['employmentLength'].value_counts(dropna=False).sort_index() # 打印结果 0.0 15989 1.0 13182 2.0 18207 3.0 16011 4.0 11833 5.0 12543 6.0 9328 7.0 8823 8.0 8976 9.0 7594 10.0 65772 NaN 11742 Name: employmentLength, dtype: int64
这个时候就已经干净不少了,巴适~