print("Shape of Train_set:",train_data.shape) print("Shape of Test_set:",test_data.shape)
Shape of Train_set: (800000, 47) Shape of Test_set: (200000, 46)
这里发现训练集比测试集多了一行python
diff = train_col.difference(test_col) print(diff)
Index(['isDefault'], dtype='object')
发现多出的一行正是咱们本次的labelweb
查看数据类型app
train_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 800000 entries, 0 to 799999 Data columns (total 47 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 800000 non-null int64 1 loanAmnt 800000 non-null float64 2 term 800000 non-null int64 3 interestRate 800000 non-null float64 4 installment 800000 non-null float64 5 grade 800000 non-null object 6 subGrade 800000 non-null object 7 employmentTitle 799999 non-null float64 8 employmentLength 753201 non-null object 9 homeOwnership 800000 non-null int64 10 annualIncome 800000 non-null float64 11 verificationStatus 800000 non-null int64 12 issueDate 800000 non-null object 13 isDefault 800000 non-null int64 14 purpose 800000 non-null int64 15 postCode 799999 non-null float64 16 regionCode 800000 non-null int64 17 dti 799761 non-null float64 18 delinquency_2years 800000 non-null float64 19 ficoRangeLow 800000 non-null float64 20 ficoRangeHigh 800000 non-null float64 21 openAcc 800000 non-null float64 22 pubRec 800000 non-null float64 23 pubRecBankruptcies 799595 non-null float64 24 revolBal 800000 non-null float64 25 revolUtil 799469 non-null float64 26 totalAcc 800000 non-null float64 27 initialListStatus 800000 non-null int64 28 applicationType 800000 non-null int64 29 earliesCreditLine 800000 non-null object 30 title 799999 non-null float64 31 policyCode 800000 non-null float64 32 n0 759730 non-null float64 33 n1 759730 non-null float64 34 n2 759730 non-null float64 35 n3 759730 non-null float64 36 n4 766761 non-null float64 37 n5 759730 non-null float64 38 n6 759730 non-null float64 39 n7 759730 non-null float64 40 n8 759729 non-null float64 41 n9 759730 non-null float64 42 n10 766761 non-null float64 43 n11 730248 non-null float64 44 n12 759730 non-null float64 45 n13 759730 non-null float64 46 n14 759730 non-null float64 dtypes: float64(33), int64(9), object(5) memory usage: 286.9+ MB
由结果咱们看得出,大部分的features是numeral类型,只有5个是object类型,咱们要重点来思考下如何去量化咱们的object类型数据。svg
看下每一个缺失值状况post
train_data.isnull().sum()
id 0 loanAmnt 0 term 0 interestRate 0 installment 0 grade 0 subGrade 0 employmentTitle 1 employmentLength 46799 homeOwnership 0 annualIncome 0 verificationStatus 0 issueDate 0 isDefault 0 purpose 0 postCode 1 regionCode 0 dti 239 delinquency_2years 0 ficoRangeLow 0 ficoRangeHigh 0 openAcc 0 pubRec 0 pubRecBankruptcies 405 revolBal 0 revolUtil 531 totalAcc 0 initialListStatus 0 applicationType 0 earliesCreditLine 0 title 1 policyCode 0 n0 40270 n1 40270 n2 40270 n3 40270 n4 33239 n5 40270 n6 40270 n7 40270 n8 40271 n9 40270 n10 33239 n11 69752 n12 40270 n13 40270 n14 40270 dtype: int64
不难发现缺失值大部分集中在n系列匿名特征和employmentLength就业年限(年)中。学习
接下来要把数据拆开分析
测试