机器学习项目实战----信用卡欺诈检测(一)

时间 2019-11-08

标签机器学习项目实战信用卡欺诈检测繁體版

原文原文链接

1、任务基础

数据集包含由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易，其中284807笔交易中有492笔被盗刷。数据集很是不平衡，正例（被盗刷）占全部交易的0.172％。，这是由于因为保密问题，咱们没法提供有关数据的原始功能和更多背景信息。特征V1，V2，... V28是使用PCA得到的主要组件，没有用PCA转换的惟一特征是“Class”和“Amount”。特征'Time'包含数据集中每一个刷卡时间和第一次刷卡时间之间通过的秒数。特征'Class'是响应变量，若是发生被盗刷，则取值1，不然为0。python

任务目的是完成数据集中正常交易数据和异常交易数据的分类，并对测试数据进行预测。app

数据集连接：https://pan.baidu.com/s/1GTeCYPhDEan_8c5t7Si_qw 提取码：b93f dom

首先导入须要使用的库机器学习

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

读取数据集文件，查看数据集前5行数据函数

data = pd.read_csv("creditcard.csv")
data.head()

在上图中Class标签表明数据分类，0表明正常数据，1表明欺诈数据。学习

这里是作信用卡数据的欺诈检测。在整个数据里面，有正常的数据，也有问题的数据。对于通常状况来讲，有问题的数据确定只占了极少部分。测试

下面绘出柱状图能够直观显示正常数据与异常数据的数量差别。　　this

count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar') # 使用pandas能够绘制一些简单的图
# 欺诈类别柱状图
plt.title("Fraud class histogram")
plt.xlabel("Class")
# 频率
plt.ylabel("Frequency")

从输出的结果能够看出正常的样本0大概有28万个，异常的样本1很是少，从图中不太容易看出来，可是其实是存在的，大概只有那么几百个。spa

由于Amount这列的数据浮动太大，在作机器学习的过程当中，须要保证特征值差别不能过大，因而须要对Amount进行预处理，标准化数据。3d

Time这一列自己没有多大用处，Amount这一列被标准化后的数据代替。全部删除这两列的数据。

# 预处理  标准化数据
from sklearn.preprocessing import StandardScaler
# norm 标准  -1表示自动判断X维度  对比源码 这里要加上.values
# 加上新的特征列
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
data.head()

2、样本数据分布不均衡解决方案

上面说到数据集里面正常数据和异常数据数量差别极大，对于这种样本数据不均衡问题，通常有如下两种策略：

（1）下采样策略：以前统计的结果能够看出0的样本有28万个，而1的样本只有几百个。如今将0的数据也变成几百个就能够了。下采样，是使样本的数据一样少
（2）过采样策略：以前统计的结果能够看出0的样本有28万个，而1的样本只有几百个。0比较多1比较少,对1的样本数据进行生成数列，让生成的数据与0的样本数据同样多。

下面首先采用下采样策略

# loc 基于标签索引  iloc 基于行号索引
# ix 基于行号和标签索引都行  可是已被放弃

# X = data.ix[:, data.columns != 'Class']
# # print(X)
# y = data.ix[:, data.columns == 'Class']

X = data.iloc[:, data.columns != 'Class'] # 特征数据
# print(X)
y = data.iloc[:, data.columns == 'Class'] # 

# Number of data points in the minority class 选取少部分异常数据集
number_records_fraud = len(data[data.Class == 1])
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal classes 选取正常类的索引
normal_indices = data[data.Class == 0].index

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
# 从正常类的索引中随机选取 X 个数据  replace 代替的意思
random_normal_indices = np.random.choice(normal_indices,
                                         number_records_fraud,
                                         replace=False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices, :]

X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']

# Showing ratio   transactions:交易
print(
    "Percentage of normal transactions:",
    len(under_sample_data[under_sample_data.Class == 0]) /
    len(under_sample_data))
print(
    "Percentage of fraud transactions:",
    len(under_sample_data[under_sample_data.Class == 1]) /
    len(under_sample_data))
print("Total number of transactions in resampled data:",
      len(under_sample_data))

Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
Total number of transactions in resampled data: 984

能够看出通过下采样策略事后，正常数据与异常数据各占50%，而且总样本数也只有少部分。

下面对原始数据集和下采样后的数据集分别进行切分操做。

# sklearn更新后在执行如下代码时可能会出现这样的问题：
# from sklearn.cross_validation import train_test_split
# ModuleNotFoundError: No module named 'sklearn.cross_validation'
# 缘由新版本已经不支持 改成如下代码
from sklearn.model_selection import train_test_split

# Whole dataset  test_size 表示训练集测试集的比例  
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=0)

print("Number transactions train dataset:", len(X_train))
print("Number transactions test dataset:", len(X_test))
print("Total number of transactions:", len(X_train) + len(X_test))

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(
    X_undersample, y_undersample, test_size=0.3, random_state=0)

print("")
print("Number transactions train dataset:", len(X_train_undersample))
print("Number transactions test dataset:", len(X_test_undersample))
print("Total number of transactions:", len(X_train_undersample) + len(X_test_undersample))

Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807

Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984

3、模型评估方法：

假设有1000个病人的数据，有990我的不患癌症，10我的是患癌症。用一个最多见的评估标准，比方说精度，就是真实值与预测值之间的差别，真实值用y来表示，预测值用y1来表示。y真实值1，2，3...10,共有10个样本，y1预测值1，2，3...10，共有10个样本，精度就是看真实值y与预测值y1是否同样的，要么都是0，要么都是1，若是是一致，就用“=”表示，好比1号真实值样本=预测值的1号样本,若是不相等就用不等号来表示。若是等号出现了8个，那么它的精确度为8/10=80%,从而肯定模型的精度。

990我的不患癌症，10我的是患癌症创建一个模型，全部的预测值都会创建一个正样本。对1000个样本输入到模型,它的精确度是多少呢?990/1000=99%。这个模型把全部的值都预测成正样本，可是没有获得任何一个负样本。在医院是想获得癌症的识别，可是检查出来的结果是0个，虽然精度达到了99%，但这个模型是没有任何的含义的，由于一个癌症病人都找不出来。在创建模型的时候必定要想好一件事，模型虽然很容易创建出来，那么难点是应该怎么样去评估这样的模型呢?

刚才提到了用精度去评估模型，可是精度有些时候是骗人的。尤为是在样本数据不均衡的状况下。接下来要讲到一个知识点叫recall，叫召回率或叫查全率。recall有0或者1，咱们的目标是找出患有癌症的那10我的。所以根据目标制定衡量的标准，就是有10个癌症病人，可以检测出来有几个?若是检测0个癌症病人，那么recall值就是0/10=0。若是检测2个癌症病人，那么recall值就是2/10=20%。用recall检测模型的效果更科学一些。创建模型无非是选择一些参数，recall的表示也并不是那么容易.在统计学中会常常提到的4个词，分别以下：

# Recall = TP/(TP+FN) Recall(召回率或查全率)
from sklearn.linear_model import LogisticRegression  # 使用逻辑回归模型
# from sklearn.cross_validation import KFold, cross_val_score  版本更新这行代码也再也不支持
from sklearn.model_selection import KFold, cross_val_score  # fold:折叠 KFold 表示切分红几分数据进行交叉验证
from sklearn.metrics import confusion_matrix, recall_score, classification_report

4、正则化惩罚：

好比有A模型的权重参数：θ一、θ二、θ3...θ10，好比还有B模型的权重参数：θ一、θ二、θ3...θ10，这两个模型的recall值都是等于90%。若是两个模型的recall值都是等于90%，是否是随便选一个均可以呢？
可是假如A模型的参数浮动比较大，具体如截图：

B模型的参数浮动较小，如截图所示：

虽然两个模型的recall值都是等于90%，可是A模型的浮动范围太大了，咱们但愿模型更加稳定一些，不光知足训练的数据，还要尽量的知足测试数据。所以但愿模型的浮动差别更小一些，差别小可使过分拟合的风险更小一些。

过分拟合的意思是在训练集表达效果很好，可是在测试集表达效果不好，所以这组模型发生了过拟合。过拟合是很是常见的现象，很大程度上是由于权重参数浮动较大引发的，所以但愿获得B模型，由于B模型的浮动差别比较小。那么怎么样可以获得B模型呢？从而就引入了正则化的东西，惩罚模型参数θ，由于模型的数据有时候分布大，有时候分布小。但愿大力度惩罚A模型，小力度惩罚B模型。咱们能够利用正则化找到更为简洁的描述方式的量化过程，咱们将损失函数改造为：

C₀表示未引入正则化惩罚以前的损失函数，C表示引入正则化惩罚后新的损失函数，w表明权重参数值。上面这个式子表达的是L1正则化。对于A模型，w值浮动比较大，若是计算|w|的话，这样的话计算的目标损失函数的值就会更大。全部就加上λ参数来惩罚这个权重值。下面还有一种L2正则化。

因而最主要就是须要设置当前惩罚的力度到底有多大？能够设置成0.1，那么惩罚力度就比较小，也能够设置惩罚力度为1，也能够设置惩罚力度为10。可是惩罚力度等于多少的时候，效果比较好呢？具体多少也不知道，须要经过交叉验证，去评估一下什么样的参数达到更好的效果。C_param_range = [0.01,0.1,1,10,100]这里就是前面提到的λ参数。须要将这5个参数不断的尝试。

5、交叉验证　　

好比有个集合叫data，一般创建机器模型的时候，先对数据进行切分或者选择，取前面80%的数据当成训练集，取20%的数据当成测试集。80%的数据是来创建一个模型，剩下的20%的数据是用来测试模型。所以第一步是将数据进行切分，切分红训练集以及测试集。这部分操做是必需要作的。第二步还要在训练集进行平均切分，好比平均切分红3份，分别是数据集1,2,3。

在创建模型的时候，无论创建什么样的模型，这个模型伴随着不少参数，有不一样的参数进行选择，这个参数选择大比较好，仍是选择小比较好一些？从经验值角度来讲，确定没办法很准的，怎么样去肯定这个参数呢？只能经过交叉验证的方式。

那什么又叫交叉验证呢？

第一次：将数据集1,2分别创建模型，用数据集3在当前权重下去验证当前模型的效果。数据集3是个验证集，验证集是训练集的一部分。用验证集去验证模型是好仍是坏。
第二次：将数据集1,3分别创建模型，用数据集2在当前权重下去验证当前模型的效果。
第三次：将数据集2,3分别创建模型，用数据集1在当前权重下去验证当前模型的效果。

若是只是求一次的交叉验证，这样的操做会存在风险。好比只作第一次交叉验证，会使3验证集偏简单一些。会使模型效果偏高，此外模型有些数据是错误值以及离群值，若是把这些不太好的数据当成验证集，会使模型的效果偏低的。模型固然是不但愿偏高也不但愿偏低，那就须要多作几回交叉验证模型，求平均值。这里有1，2，3分别做验证集，每一个验证集都有评估的标准。最终模型的效果将1，2，3的评估效果加在一块儿，再除以3，就能够获得模型一个大体的效果。

def printing_Kfold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    
    # Different C parameters
    c_param_range = [0.01,0.1,1,10,100]
    
    result_table = pd.DataFrame(index=range(len(c_param_range),2),columns=['C_parameter','Mean recall score'])
    result_table['C_parameter'] = c_param_range
    
    # the k-fold will give 2 lists:train_indices=indices[0],test_indices = indices[1]
    j=0  # 循环找到最好的惩罚力度
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter:',c_param)
        print('-------------------------------------------')
        print('')
        
        recall_accs = []
        for iteration,indices in enumerate(fold.split(x_train_data)):
            
            # 使用特定的C参数调用逻辑回归模型
            # Call the logistic regression model with a certain C parameter
            # 参数 solver=’liblinear’ 消除警告
            # 出现警告：模型未能收敛 ，请增长收敛次数
            #  ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
            #  "the number of iterations.", ConvergenceWarning)
            #  增长参数 max_iter 默认1000
            lr = LogisticRegression(C = c_param, penalty='l1', solver='liblinear',max_iter=10000)
            # Use the training data to fit the model. In this case, we use the portion
            # of the fold to train the model with indices[0], We then predict on the
            # portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
            
            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
            
            # Calculate the recall score and append it to a list for recall scores 
            # representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ',iteration,': recall score = ',recall_acc)
            
        # the mean value of those recall scores is the metric we want to save and get
        # hold of.
        result_table.loc[j,'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ',np.mean(recall_accs))
        print('')
        
    # 注意此处报错  源代码没有astype('float64')
    best_c = result_table.loc[result_table['Mean recall score'].astype('float64').idxmax()]['C_parameter']
    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter',best_c)
    print('*********************************************************************************')
    
    return best_c

使用下采样数据集调用上面这个函数　　

best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)

输出结果：

-------------------------------------------
C parameter: 0.01
-------------------------------------------

Iteration  0 : recall score =  0.958904109589041
Iteration  1 : recall score =  0.9178082191780822
Iteration  2 : recall score =  1.0
Iteration  3 : recall score =  0.9864864864864865
Iteration  4 : recall score =  0.9545454545454546

Mean recall score  0.9635488539598128

-------------------------------------------
C parameter: 0.1
-------------------------------------------

Iteration  0 : recall score =  0.8356164383561644
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9322033898305084
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.8939393939393939

Mean recall score  0.8941437733404299

-------------------------------------------
C parameter: 1
-------------------------------------------

Iteration  0 : recall score =  0.8493150684931506
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.9090909090909091

Mean recall score  0.9100832939235539

-------------------------------------------
C parameter: 10
-------------------------------------------

Iteration  0 : recall score =  0.863013698630137
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9324324324324325
Iteration  4 : recall score =  0.9242424242424242

Mean recall score  0.9131506202785514

-------------------------------------------
C parameter: 100
-------------------------------------------

Iteration  0 : recall score =  0.863013698630137
Iteration  1 : recall score =  0.863013698630137
Iteration  2 : recall score =  0.9830508474576272
Iteration  3 : recall score =  0.9459459459459459
Iteration  4 : recall score =  0.9242424242424242

Mean recall score  0.9158533229812542

*********************************************************************************
Best model to choose from cross validation is with C parameter 0.01
*********************************************************************************

根据上面结果能够看出，当正则化参数为0.01时，recall的值最高。

未完待续。。。