本项目为kaggle上Facebook举行的一次比赛,地址见数据来源,完整代码见个人github,欢迎来玩~html
数据探索——Data_Exploration.ipynbpython
数据预处理&特征工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynbgit
模型设计及评测——Model_Design.ipynbgithub
kaggle算法
numpysegmentfault
pandasapp
matplotlib机器学习
sklearnide
xgboost工具
mlxtend: 含有聚和算法Stacking
项目总体运行时间预估为60min左右,在Ubuntu系统,8G内存,运行结果见所提交的jupyter notebook文件
因为文章内容过长,因此分为两篇文章,总共包含四个部分
数据探索
数据预处理及特征工程
模型设计
评估及总结
import numpy as np import pandas as pd %matplotlib inline from IPython.display import display
df_bids = pd.read_csv('bids.csv', low_memory=False) df_train = pd.read_csv('train.csv') df_test = pd.read_csv('test.csv')
df_bids.head()
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
df_train.head() # df_train.dtypes
bidder_id | payment_account | address | outcome | |
---|---|---|---|---|
0 | 91a3c57b13234af24875c56fb7e2b2f4rb56a | a3d2de7675556553a5f08e4c88d2c228754av | a3d2de7675556553a5f08e4c88d2c228vt0u4 | 0.0 |
1 | 624f258b49e77713fc34034560f93fb3hu3jo | a3d2de7675556553a5f08e4c88d2c228v1sga | ae87054e5a97a8f840a3991d12611fdcrfbq3 | 0.0 |
2 | 1c5f4fc669099bfbfac515cd26997bd12ruaj | a3d2de7675556553a5f08e4c88d2c2280cybl | 92520288b50f03907041887884ba49c0cl0pd | 0.0 |
3 | 4bee9aba2abda51bf43d639013d6efe12iycd | 51d80e233f7b6a7dfdee484a3c120f3b2ita8 | 4cb9717c8ad7e88a9a284989dd79b98dbevyi | 0.0 |
4 | 4ab12bc61c82ddd9c2d65e60555808acqgos1 | a3d2de7675556553a5f08e4c88d2c22857ddh | 2a96c3ce94b3be921e0296097b88b56a7x1ji | 0.0 |
# 查看各表格中是否存在空值 print 'Is there any missing value in bids?',df_bids.isnull().any().any() print 'Is there any missing value in train?',df_train.isnull().any().any() print 'Is there any missing value in test?',df_test.isnull().any().any()
Is there any missing value in bids? True Is there any missing value in train? False Is there any missing value in test? False
整个对三个数据集进行空值判断,发现用户数据训练集和测试集均完好失数据,而在竞标行为数据集中存在缺失值的状况,下面便针对bids数据进一步寻找缺失值
# nan_rows = df_bids[df_bids.isnull().T.any().T] # print nan_rows pd.isnull(df_bids).any()
bid_id False bidder_id False auction False merchandise False device False time False country True ip False url False dtype: bool
missing_country = df_bids['country'].isnull().sum().sum() print 'No. of missing country: ', missing_country normal_country = df_bids['country'].notnull().sum().sum() print 'No. of normal country: ', normal_country
No. of missing country: 8859 No. of normal country: 7647475
import matplotlib.pyplot as plt labels = ['unknown', 'normal'] sizes = [missing_country, normal_country] explode = (0.1, 0) fig1, ax1 = plt.subplots() ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) ax1.axis('equal') plt.title('Distribution of missing countries vs. normal countries') plt.show()
综合上述的分析能够发现,在竞标行为用户的country
一栏属性中存在不多一部分用户行为是没有country
记录的,在预处理部分能够针对这部分缺失数据进行填充操做,有两种思路:
针对原始行为数据按照用户分组后,看看每一个对应的用户竞标时常常所位于的国家信息,对缺失值填充常驻国家
针对原始行为数据按照用户分组后,按时间顺序对每组用户中的缺失值前向或后向填充相邻的国家信息
# 查看各个数据的记录数 # 看看数据的id是不是惟一标识 print df_bids.shape[0] print len(df_bids['bid_id'].unique()) print df_train.shape[0] print len(df_train['bidder_id'].unique()) print df_test.shape[0] print len(df_test['bidder_id'].unique())
7656334 7656334 2013 2013 4700 4700
# 简单统计各项基本特征(类别特征)的数目(除去时间) print 'total bidder in bids: ', len(df_bids['bidder_id'].unique()) print 'total auction in bids: ', len(df_bids['auction'].unique()) print 'total merchandise in bids: ', len(df_bids['merchandise'].unique()) print 'total device in bids: ', len(df_bids['device'].unique()) print 'total country in bids: ', len(df_bids['country'].unique()) print 'total ip in bids: ', len(df_bids['ip'].unique()) print 'total url in bids: ', len(df_bids['url'].unique())
total bidder in bids: 6614 total auction in bids: 15051 total merchandise in bids: 10 total device in bids: 7351 total country in bids: 200 total ip in bids: 2303991 total url in bids: 1786351
由上述基本特征能够看到:
竞标行为中的用户总数少于训练集+测试集的用户数,也就是说并非一一对应的,接下来验证下竞标行为数据中的用户是否彻底来自训练集和测试集
商品类别和国家的种类相对其余特征较少,能够做为自然的类别特征提取出来进行处理,而其他的特征可能更多的进行计数统计
lst_all_users = (df_train['bidder_id'].unique()).tolist() + (df_test['bidder_id'].unique()).tolist() print 'total bidders of train and test set',len(lst_all_users) lst_bidder = (df_bids['bidder_id'].unique()).tolist() print 'total bidders in bids set',len(lst_bidder) print 'Is bidders in bids are all from train+test set? ',set(lst_bidder).issubset(set(lst_all_users))
total bidders of train and test set 6713 total bidders in bids set 6614 Is bidders in bids are all from train+test set? True
lst_nobids = [i for i in lst_all_users if i not in lst_bidder] print 'No. of bidders never bid: ',len(lst_nobids) lst_nobids_train = [i for i in lst_nobids if i in (df_train['bidder_id'].unique()).tolist()] lst_nobids_test = [i for i in lst_nobids if i in (df_test['bidder_id'].unique()).tolist()] print 'No. of bidders never bid in train set: ',len(lst_nobids_train) print 'No. of bidders never bid in test set: ',len(lst_nobids_test)
No. of bidders never bid: 99 No. of bidders never bid in train set: 29 No. of bidders never bid in test set: 70
data_source = ['train', 'test'] y_pos = np.arange(len(data_source)) num_never_bids = [len(lst_nobids_train), len(lst_nobids_test)] plt.bar(y_pos, num_never_bids, align='center', alpha=0.5) plt.xticks(y_pos, data_source) plt.ylabel('bidders no bids') plt.title('Source of no bids bidders') plt.show()
print df_train[(df_train['bidder_id'].isin(lst_nobids_train)) & (df_train['outcome']==1.0)]
Empty DataFrame Columns: [bidder_id, payment_account, address, outcome] Index: []
由上述计算可知存在99个竞标者无竞标记录,其中29位来自训练集,70位来自测试集,并且这29位来自训练集的竞标者未被标记为机器人用户,因此能够针对测试集中的这70位用户后续标记为人类或者取平均值处理
# check the partition of bots in train print (df_train[df_train['outcome'] == 1].shape[0]*1.0) / df_train.shape[0] * 100,'%'
5.11674118231 %
训练集中的标记为机器人的用户占全部用户数目约5%
df_train.groupby('outcome').size().plot(labels=['Human', 'Robot'], kind='pie', autopct='%.2f', figsize=(4, 4), title='Distribution of Human vs. Robots', legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f477135c5d0>
由上述训练集中的正负例分布能够看到本数据集正负例比例失衡,因此后续考虑使用AUC(不受正负例比例影响)做为评价指标,此外尽可能采用Gradient Boosting族模型来进行训练
import numpy as np import pandas as pd import pickle %matplotlib inline from IPython.display import display
bids = pd.read_csv('bids.csv') train = pd.read_csv('train.csv') test = pd.read_csv('test.csv')
针对前面数据探索部分所发现的竞标行为数据中存在的国家眷性缺失问题,考虑使用针对原始行为数据按照用户分组后,按时间顺序对每组用户中的缺失值前向或后向填充相邻的国家信息的方法来进行缺失值的填充处理
display(bids.head())
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
# pd.algos.is_monotonic_int64(bids.time.values, True)[0] print 'Is the time monotonically non-decreasing? ', pd.Index(bids['time']).is_monotonic
Is the time monotonically non-decreasing? False
# bidder_group = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id') bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].ffill() bids['country'] = bids.sort_values(['bidder_id', 'time']).groupby('bidder_id')['country'].bfill()
display(bids.head())
bid_id | bidder_id | auction | merchandise | device | time | country | ip | url | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 | ewmzr | jewelry | phone0 | 9759243157894736 | us | 69.166.231.58 | vasstdc27m7nks3 |
1 | 1 | 668d393e858e8126275433046bbd35c6tywop | aeqok | furniture | phone1 | 9759243157894736 | in | 50.201.125.84 | jmqlhflrzwuay9c |
2 | 2 | aa5f360084278b35d746fa6af3a7a1a5ra3xe | wa00e | home goods | phone2 | 9759243157894736 | py | 112.54.208.157 | vasstdc27m7nks3 |
3 | 3 | 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi | jefix | jewelry | phone4 | 9759243157894736 | in | 18.99.175.133 | vasstdc27m7nks3 |
4 | 4 | 8393c48eaf4b8fa96886edc7cf27b372dsibi | jefix | jewelry | phone5 | 9759243157894736 | in | 145.138.5.37 | vasstdc27m7nks3 |
print 'Is there any missing value in bids?',bids.isnull().any().any() # pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? True
missing_country = bids['country'].isnull().sum().sum() print 'No. of missing country: ', missing_country normal_country = bids['country'].notnull().sum().sum() print 'No. of normal country: ', normal_country
No. of missing country: 5 No. of normal country: 7656329
nan_rows = bids[bids.isnull().T.any().T] print nan_rows
bid_id bidder_id auction \ 1351177 1351177 f3ab8c9ecc0d021ebc81e89f20c8267bn812w jefix 2754184 2754184 88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2 cc5fs 2836631 2836631 29b8af2fea3881ef61911612372dac41vczqv jqx39 3125892 3125892 df20f216cbb0b0df5a7b2e94b16a7853iyw9g jqx39 5153748 5153748 5e05ec450e2dd64d7996a08bbbca4f126nzzk jqx39 merchandise device time country \ 1351177 office equipment phone84 9767200789473684 NaN 2754184 mobile phone150 9633363947368421 NaN 2836631 jewelry phone72 9634034894736842 NaN 3125892 books and music phone106 9635755105263157 NaN 5153748 mobile phone267 9645270210526315 NaN ip url 1351177 80.211.119.111 g9pgdfci3yseml5 2754184 20.67.240.88 ctivbfq55rktail 2836631 149.210.107.205 vasstdc27m7nks3 3125892 26.23.62.59 ac9xlqtfg0cx5c5 5153748 145.7.194.40 0em0vg1f0zuxonw
# print bids[bids['bid_id']==1351177] nan_bidder = nan_rows['bidder_id'].values.tolist() # print nan_bidder print bids[bids['bidder_id'].isin(nan_bidder)]
bid_id bidder_id auction \ 1351177 1351177 f3ab8c9ecc0d021ebc81e89f20c8267bn812w jefix 2754184 2754184 88ef9cfdbec4c9e33f6c2e0b512e7a01dp2p2 cc5fs 2836631 2836631 29b8af2fea3881ef61911612372dac41vczqv jqx39 3125892 3125892 df20f216cbb0b0df5a7b2e94b16a7853iyw9g jqx39 5153748 5153748 5e05ec450e2dd64d7996a08bbbca4f126nzzk jqx39 merchandise device time country \ 1351177 office equipment phone84 9767200789473684 NaN 2754184 mobile phone150 9633363947368421 NaN 2836631 jewelry phone72 9634034894736842 NaN 3125892 books and music phone106 9635755105263157 NaN 5153748 mobile phone267 9645270210526315 NaN ip url 1351177 80.211.119.111 g9pgdfci3yseml5 2754184 20.67.240.88 ctivbfq55rktail 2836631 149.210.107.205 vasstdc27m7nks3 3125892 26.23.62.59 ac9xlqtfg0cx5c5 5153748 145.7.194.40 0em0vg1f0zuxonw
在对总体数据的部分用户缺失国家的按照各个用户分组后在时间上前向和后向填充后,仍然存在5个用户缺失了国家信息,结果发现这5个用户是仅有一次竞标行为,下面看看这5个用户还有什么特征
lst_nan_train = [i for i in nan_bidder if i in (train['bidder_id'].unique()).tolist()] lst_nan_test = [i for i in nan_bidder if i in (test['bidder_id'].unique()).tolist()] print 'No. of bidders 1 bid in train set: ',len(lst_nan_train) print 'No. of bidders 1 bid in test set: ',len(lst_nan_test)
No. of bidders 1 bid in train set: 1 No. of bidders 1 bid in test set: 4
print train[train['bidder_id']==lst_nan_train[0]]['outcome']
546 0.0 Name: outcome, dtype: float64
因为这5个用户仅有一次竞标行为,并且其中1个用户来自训练集,4个来自测试集,由训练集用户的标记为人类,加上行为数太少,因此考虑对这5个用户的竞标行为数据予以舍弃,特别对测试集的4个用户后续操做相似以前对无竞标行为的用户,预测值填充最终模型的平均预测值
bid_to_drop = nan_rows.index.values.tolist() # print bid_to_drop bids.drop(bids.index[bid_to_drop], inplace=True)
print 'Is there any missing value in bids?',bids.isnull().any().any() pickle.dump(bids, open('bids.pkl', 'w'))
Is there any missing value in bids? False
根据前面的数据探索,因为数据集大部分由类别数据或者离散型数据构成,因此首先针对竞标行为数据按照竞标者分组统计其各项属性的数目,好比使用设备种类,参与竞标涉及国家,ip种类等等
# group by bidder to do some statistics bidders = bids.groupby('bidder_id') # pickle.dump(bids, open('bidders.pkl', 'w'))
# print bidders['device'].count() def feature_count(group): dct_cnt = {} dct_cnt['devices_c'] = group['device'].unique().shape[0] dct_cnt['countries_c'] = group['country'].unique().shape[0] dct_cnt['ip_c'] = group['ip'].unique().shape[0] dct_cnt['url_c'] = group['url'].unique().shape[0] dct_cnt['auction_c'] = group['auction'].unique().shape[0] dct_cnt['auc_mean'] = np.mean(group['auction'].value_counts()) # bids_c/auction_c # dct_cnt['dev_mean'] = np.mean(group['device'].value_counts()) # bids_c/devices_c dct_cnt['merch_c'] = group['merchandise'].unique().shape[0] dct_cnt['bids_c'] = group.shape[0] dct_cnt = pd.Series(dct_cnt) return dct_cnt
cnt_bidder = bidders.apply(feature_count)
display(cnt_bidder.describe()) # cnt_bidder.to_csv('cnt_bidder.csv') # print cnt_bidder[cnt_bidder['merch_c']==2]
auc_mean | auction_c | bids_c | countries_c | devices_c | ip_c | merch_c | url_c | |
---|---|---|---|---|---|---|---|---|
count | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 | 6609.000000 |
mean | 6.593493 | 57.850810 | 1158.470117 | 12.733848 | 73.492359 | 544.507187 | 1.000151 | 290.964140 |
std | 30.009242 | 131.814053 | 9596.595169 | 22.556570 | 172.171106 | 3370.730666 | 0.012301 | 2225.912425 |
min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 1.000000 | 2.000000 | 3.000000 | 1.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 |
50% | 1.677419 | 10.000000 | 18.000000 | 3.000000 | 8.000000 | 12.000000 | 1.000000 | 5.000000 |
75% | 4.142857 | 47.000000 | 187.000000 | 12.000000 | 57.000000 | 111.000000 | 1.000000 | 36.000000 |
max | 1327.366667 | 1726.000000 | 515033.000000 | 178.000000 | 2618.000000 | 111918.000000 | 2.000000 | 81376.000000 |
在对竞标行为数据按照用户分组后,对数据集中的每个产品特征构建一个散布矩阵(scatter matrix),来看看各特征之间的相关性
# 对于数据中的每一对特征构造一个散布矩阵 pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');
在针对竞标行为数据按照竞标用户进行分组基本统计后由上表能够看出,此时并未考虑时间戳的情形下,有如下基本结论:
由各项统计的最大值与中位值,75%值的比较能够看到除了商品类别一项,其余的几项多少都存在一些异常数值,或许能够做为异常行为进行观察
各特征的倾斜度很大,考虑对特征进行取对数的操做,并再次输出散布矩阵看看相关性。
商品类别计数这一特征的方差很小,并且从中位数乃至75%的统计来看,大多数用户仅对同一类别商品进行拍卖,并且由于前面数据探索部分发现商品类别自己适合做为类别数据,因此考虑分多个类别进行单独统计,而在计数特征中舍弃该特征。
cnt_bidder.drop('merch_c', axis=1, inplace=True)
cnt_bidder = np.log(cnt_bidder)
pd.scatter_matrix(cnt_bidder, alpha = 0.3, figsize = (16,10), diagonal = 'kde');
由上面的散布矩阵能够看到,个行为特征之间并无表现出很强的相关性,虽然其中的ip计数和竞标计数,设备计数在进行对数操做处理以后表现出轻微的正相关性,可是因为是在作了对数操做以后才体现,并且从图中能够看到并不是很强的相关性,因此保留这三个特征。
针对前述的异常行为,先从原train数据集中的机器人、人类中分别挑选几个样本进行追踪观察他们在按照bidders分组后的统计结果,对比看看
cnt_bidder.to_csv('cnt_bidder.csv')
# trace samples,first 2 bots, last 2 humen indices = ['9434778d2268f1fa2a8ede48c0cd05c097zey','aabc211b4cf4d29e4ac7e7e361371622pockb', 'd878560888b11447e73324a6e263fbd5iydo1','91a3c57b13234af24875c56fb7e2b2f4rb56a'] # build a DataFrame for the choosed indices samples = pd.DataFrame(cnt_bidder.loc[indices], columns = cnt_bidder.keys()).reset_index(drop = True) print "Chosen samples of training dataset:(first 2 bots, last 2 humen)" display(samples)
Chosen samples of training dataset:(first 2 bots, last 2 humen)
auc_mean | auction_c | bids_c | countries_c | devices_c | ip_c | url_c | |
---|---|---|---|---|---|---|---|
0 | 3.190981 | 5.594711 | 8.785692 | 4.174387 | 6.011267 | 8.147578 | 7.557995 |
1 | 2.780432 | 4.844187 | 7.624619 | 2.639057 | 3.178054 | 5.880533 | 1.609438 |
2 | 0.287682 | 1.098612 | 1.386294 | 1.098612 | 1.386294 | 1.386294 | 0.000000 |
3 | 0.287682 | 2.890372 | 3.178054 | 1.791759 | 2.639057 | 2.995732 | 0.000000 |
使用seaborn来对上面四个例子的热力图进行可视化,看看percentile的状况
import matplotlib.pyplot as plt import seaborn as sns # look at percentile ranks pcts = 100. * cnt_bidder.rank(axis=0, pct=True).loc[indices].round(decimals=3) print pcts # visualize percentiles with heatmap sns.heatmap(pcts, yticklabels=['robot 1', 'robot 2', 'human 1', 'human 2'], annot=True, linewidth=.1, vmax=99, fmt='.1f', cmap='YlGnBu') plt.title('Percentile ranks of\nsamples\' feature statistics') plt.xticks(rotation=45, ha='center');
auc_mean auction_c bids_c \ bidder_id 9434778d2268f1fa2a8ede48c0cd05c097zey 94.9 94.6 97.0 aabc211b4cf4d29e4ac7e7e361371622pockb 92.4 87.2 92.3 d878560888b11447e73324a6e263fbd5iydo1 39.8 30.4 30.2 91a3c57b13234af24875c56fb7e2b2f4rb56a 39.8 60.2 53.0 countries_c devices_c ip_c url_c bidder_id 9434778d2268f1fa2a8ede48c0cd05c097zey 95.4 95.6 96.7 97.4 aabc211b4cf4d29e4ac7e7e361371622pockb 77.3 63.8 84.8 50.3 d878560888b11447e73324a6e263fbd5iydo1 48.8 38.7 34.2 13.4 91a3c57b13234af24875c56fb7e2b2f4rb56a 63.7 56.8 56.2 13.4
由上面的热力图对比能够看到,机器人的各项统计指标除去商品类别上的统计之外,均比人类用户要高,因此考虑据此设计基于基本统计指标规则的基准模型,其中最显著的特征差别应该是在auc_mean
一项即用户在各个拍卖场的平均竞标次数,不妨先按照异常值处理的方法来找出上述基础统计中的异常状况
因为最终目的是从竞标者中寻找到机器人用户,而根据常识,机器人用户的各项竞标行为的操做应该比人类要频繁许多,因此能够从异常值检验的角度来设计朴素分类器,根据以前针对不一样用户统计的基本特征计数状况,能够先针对每个特征找出其中的疑似异经常使用户列表,最后整合各个特征生成的用户列表,认为超过多个特征异常的用户为机器人用户。
# find the outliers for each feature lst_outlier = [] for feature in cnt_bidder.keys(): # percentile 25th Q1 = np.percentile(cnt_bidder[feature], 25) # percentile 75th Q3 = np.percentile(cnt_bidder[feature], 75) step = 1.5 * (Q3 - Q1) # show outliers # print "Data points considered outliers for the feature '{}':".format(feature) display(cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))]) lst_outlier += cnt_bidder[~((cnt_bidder[feature] >= Q1 - step) & (cnt_bidder[feature] <= Q3 + step))].index.values.tolist()
再找到各类特征的全部可能做为‘异常值’的用户id以后,能够对其作一个基本统计,进一步找出其中超过某几个特征值均异常的用户,通过测试,考虑到原始train集合里bots用户不到5%,因此最终肯定以不低于1个特征值均异常的用户做为异经常使用户的一个假设,由此与train集合里的用户进行交叉,能够获得一个用户子集,能够做为朴素分类器的一个操做方法。
# print len(lst_outlier) from collections import Counter freq_outlier = dict(Counter(lst_outlier)) perhaps_outlier = [i for i in freq_outlier if freq_outlier[i] >= 1] print len(perhaps_outlier)
214
# basic_pred = test[test['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist() train_pred = train[train['bidder_id'].isin(perhaps_outlier)]['bidder_id'].tolist() print len(train_pred)
76
根据前面数据探索知本实验中的数据集的正负例比例约为19:1,有些失衡,因此考虑使用auc这种不受正负例比例影响的评价指标做为衡量标准,现针对所涉及的朴素分类器在原始训练集上的表现获得一个基准得分
from sklearn.metrics import roc_auc_score y_true = train['outcome'] naive_pred = pd.DataFrame(columns=['bidder_id', 'prediction']) naive_pred['bidder_id'] = train['bidder_id'] naive_pred['prediction'] = np.where(naive_pred['bidder_id'].isin(train_pred), 1.0, 0.0) basic_pred = naive_pred['prediction'] print roc_auc_score(y_true, basic_pred)
0.54661464952
在通过上述对基本计数特征的统计以后,目前还没有针对非类别特征:时间戳进行处理,而在以前的数据探索过程当中,针对商品类别和国家这两个类别属性,能够将原始的单一特征转换为多个特征分别统计,此外,在上述分析过程当中,咱们发现针对用户分组能够进一步对于拍卖场进行分组统计。
对时间戳进行处理
针对商品类别、国家转换为多个类别分别进行统计
按照用户-拍卖场进行分组进一步统计
主要是分析各个竞标行为的时间间隔,即统计竞标行为表中在同一拍卖场的各个用户之间的竞标行为间隔
而后针对每一个用户对其余用户的时间间隔计算
时间间隔均值
时间间隔最大值
时间间隔最小值
from collections import defaultdict def generate_timediff(): bids_grouped = bids.groupby('auction') bds = defaultdict(list) last_row = None for bids_auc in bids_grouped: for i, row in bids_auc[1].iterrows(): if last_row is None: last_row = row continue time_difference = row['time'] - last_row['time'] bds[row['bidder_id']].append(time_difference) last_row = row df = [] for key in bds.keys(): df.append({'bidder_id': key, 'mean': np.mean(bds[key]), 'min': np.min(bds[key]), 'max': np.max(bds[key])}) pd.DataFrame(df).to_csv('tdiff.csv', index=False)
generate_timediff()
因为内容长度超过限制,后续内容请移步使用机器学习识别出拍卖场中做弊的机器人用户(二)