本文经过利用信用卡的历史交易数据进行机器学习,构建信用卡反欺诈预测模型,对客户信用卡盗刷进行预测python
对信用卡盗刷事情进行预测对于挽救客户、银行损失意义十分重大,此项目数据集来源于Kaggle,数据集包含由欧洲持卡人于2013年9月使用信用卡进行交的数据。此数据集显示两天内发生的交易,其中284,807笔交易中有492笔被盗刷。数据集很是不平衡,积极的类(被盗刷)占全部交易的0.172%。因断定信用卡持卡人信用卡是否会被盗刷为二分类问题,解决分类问题咱们能够有逻辑回归、SVM、随机森林算法,也可利用boost集成学习的XGboost算法进行数据的训练与判别,本文中采用逻辑回归算法进行测试。算法
import numpy as np import pandas as pd import sklearn as skl import seaborn as sns import matplotlib.pyplot as plt from matplotlib import gridspec import warnings warnings.filterwarnings("ignore") from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report from sklearn.linear_model import LogisticRegression data = pd.read_csv('../input/creditcard.csv') data.info() data.head()
可见数据共有31列,284807行,其中V1-V28为结构化数据,另外一列为整形数据,也为类别属性Class,Amount和Time的数据特征与规格与其余特征不大相同。app
print('No Frauds', round(data['Class'].value_counts()[0]/len(data) * 100,2), '% of the dataset') print('Frauds', round(data['Class'].value_counts()[1]/len(data) * 100,2), '% of the dataset') colors = ["#0101DF", "#DF0101"] sns.countplot('Class', data=data, palette=colors) plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)
No Frauds 99.83 % of the dataset Frauds 0.17 % of the dataset
能够看出数据很不均衡:数据不均衡极可能致使咱们模型预测结果‘0’时很准确,而预测‘1’时并不许确。dom
由上图知Fraud 与No Fraud的数据不均衡,若直接进行数据建模则会形成以下问题:机器学习
1)过拟合,因样本中存在大量的正例(No Fraud),故机器可能过多学习正例的特征而形成判断错误学习
2)特征关联错误,因样本数据不平衡,容易将不一样的特征属性误关联测试
解决方案有以下:spa
1)Random undersampling,欠采样3d
经过选取正例中与Fraud相同数量的样本构成均衡的样本,这样新的样本样本中,Fraud、No Fraud各占50%rest
2)Oversampling,过采样
采用SMOTE算法,从少数类Fraud中建立合成点,以便在少数类和多数类之间达到平衡,没必要删除任何行,这与随机欠采样不一样,也因为没有如前所述未删除任何行,所以须要更多时间进行训练
须要说明的是虽然咱们在实现Random Undersampling或Oversampling技术时对数据进行处理,但仍须要原始测试集上测试咱们的模型,而不是在欠采样或过采样上。欠采样和过采样的做用是使构建的模型合适,最终仍是服务于原始数据集。以后对分别采用欠采样和过采样的方法进行数据处理、建模,再对原始数据集进行预测分析。
此外还发现Amount列数值较大,不符合数据类似性原则,所以须要归一化使其范围在(-1, 1),不然在进行皮尔逊相关性分析时会出错。
f,(ax1,ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4)) bins=30 ax1.hist(data[data.Class ==1]['Amount'],bins=bins) ax1.set_title('Fraud') ax2.hist(data[data.Class == 0]['Amount'], bins=bins) ax2.set_title('Normal') plt.xlabel('Amount ($)') plt.ylabel('Number of Transactions') plt.yscale('log') plt.show()
\(信用卡被盗刷发生的金额与信用卡正经常使用户发生的金额相比,比较小。这说明信用卡盗刷者为了避免引发信用卡卡主的注意,更偏向选择小金额消费\)
fig, ax = plt.subplots(1, 2, figsize=(18,4)) amount_val = data['Amount'].values time_val = data['Time'].values sns.distplot(amount_val, ax=ax[0], color='r') ax[0].set_title('Distribution of Transaction Amount', fontsize=14) ax[0].set_xlim([min(amount_val), max(amount_val)]) sns.distplot(time_val, ax=ax[1], color='b') ax[1].set_title('Distribution of Transaction Time', fontsize=14) ax[1].set_xlim([min(time_val), max(time_val)]) plt.show()
plt.figure(figsize=(12,28*4)) v_features = data.ix[:,1:29].columns gs = gridspec.GridSpec(28, 1) for i, cn in enumerate(data[v_features]): ax = plt.subplot(gs[i]) sns.distplot(data[data.Class == 1][cn], bins=50) sns.distplot(data[data.Class == 0][cn], bins=50) ax.set_xlabel('') ax.set_title('histogram of feature:' + str(cn)) plt.show()
对Amount进行归一化,使其范围在(-1, 1)
from sklearn.preprocessing import StandardScaler data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) data = data.drop(['Time','Amount'],axis=1)
此外还须要进行采用处理,分为欠采样处理和过采样处理,Random undersampling 和 Oversampling,这里先按照欠处理采样进行分析。所以咱们获得一个正反例平衡的数据样本。
X = data.ix[:, data.columns != 'Class'] y = data.ix[:, data.columns == 'Class'] number_records_fraud = len(data[data.Class == 1]) fraud_indices = np.array(data[data.Class == 1].index) normal_indices = data[data.Class == 0].index random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) random_normal_indices = np.array(random_normal_indices) under_sample_indices = np.concatenate([fraud_indices,random_normal_indices]) under_sample_data = data.iloc[under_sample_indices,:] X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class'] y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class'] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0) X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample ,y_undersample, test_size = 0.3, random_state = 0)
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(24,20)) # Entire DataFrame corr = data.corr() sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax1) ax1.set_title("Imbalanced Correlation Matrix \n (don't use for reference)", fontsize=14) sub_sample_corr = under_sample_data.corr() sns.heatmap(sub_sample_corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax2) ax2.set_title('SubSample Correlation Matrix \n (use for reference)', fontsize=14) plt.show()
能够看到如不对非平衡数据进行处理,有可能得出错误的相关性结论
经过heatmap发现,V14,V12和V10呈负相关,这意味着这些值越低,最终结果就越有可能成为欺诈交易,V4,V11和V19正相关。注意这些值越高,最终结果越有可能成为欺诈交易。
绘制上述的正相关、负相关的特征属性的箱线图
f, axes = plt.subplots(ncols=4, figsize=(20,4)) # Positive correlations (The higher the feature the probability increases that it will be a fraud transaction) sns.boxplot(x="Class", y="V11", data=under_sample_data, palette=colors, ax=axes[0]) axes[0].set_title('V11 vs Class Positive Correlation') sns.boxplot(x="Class", y="V4", data=under_sample_data, palette=colors, ax=axes[1]) axes[1].set_title('V4 vs Class Positive Correlation') sns.boxplot(x="Class", y="V2", data=under_sample_data, palette=colors, ax=axes[2]) axes[2].set_title('V2 vs Class Positive Correlation') sns.boxplot(x="Class", y="V19", data=under_sample_data, palette=colors, ax=axes[3]) axes[3].set_title('V19 vs Class Positive Correlation') plt.show()
f, axes = plt.subplots(ncols=4, figsize=(20,4)) # Positive correlations (The higher the feature the probability increases that it will be a fraud transaction) sns.boxplot(x="Class", y="V11", data=new_df, palette=colors, ax=axes[0]) axes[0].set_title('V11 vs Class Positive Correlation') sns.boxplot(x="Class", y="V4", data=new_df, palette=colors, ax=axes[1]) axes[1].set_title('V4 vs Class Positive Correlation') sns.boxplot(x="Class", y="V2", data=new_df, palette=colors, ax=axes[2]) axes[2].set_title('V2 vs Class Positive Correlation') sns.boxplot(x="Class", y="V19", data=new_df, palette=colors, ax=axes[3]) axes[3].set_title('V19 vs Class Positive Correlation') plt.show()
在异常值处理前,须要可视化咱们将要使用的特征的。因为上述V1四、V十二、V10为负相关特征,观察这些特征的分布,与其余相比,V14是惟一具备高斯分布的特征。根据四分位数肯定阀值,对在阀值外的异常数据进行剔除。
from scipy.stats import norm f, (ax1, ax2, ax3) = plt.subplots(1,4, figsize=(20, 6)) v14_fraud_dist = under_sample_data['V14'].loc[under_sample_data['Class'] == 1].values sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861') ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14) v12_fraud_dist = under_sample_data['V12'].loc[under_sample_data['Class'] == 1].values sns.distplot(v12_fraud_dist,ax=ax2, fit=norm, color='#56F9BB') ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14) v10_fraud_dist = under_sample_data['V10'].loc[under_sample_data['Class'] == 1].values sns.distplot(v10_fraud_dist,ax=ax3, fit=norm, color='#C5B3F9') ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14) plt.show()
查看特征V1四、V十二、V10的分布图,选择具备正态分布特征的V14
# -----> V14 Removing Outliers (Highest Negative Correlated with Labels) v14_fraud = under_sample_data['V14'].loc[under_sample_data['Class'] == 1].values q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75) print('Quartile 25: {} | Quartile 75: {}'.format(q25, q75)) v14_iqr = q75 - q25 print('iqr: {}'.format(v14_iqr)) v14_cut_off = v14_iqr * 1.5 v14_lower, v14_upper = q25 - v14_cut_off, q75 + v14_cut_off print('Cut Off: {}'.format(v14_cut_off)) print('V14 Lower: {}'.format(v14_lower)) print('V14 Upper: {}'.format(v14_upper)) outliers = [x for x in v14_fraud if x < v14_lower or x > v14_upper] print('Feature V14 Outliers for Fraud Cases: {}'.format(len(outliers))) print('V14 outliers:{}'.format(outliers)) under_sample_data = under_sample_data.drop(under_sample_data[(under_sample_data['V14'] > v14_upper) | (under_sample_data['V14'] < v14_lower)].index) print('----' * 44) # -----> V12 removing outliers from fraud transactions v12_fraud = under_sample_data['V12'].loc[under_sample_data['Class'] == 1].values q25, q75 = np.percentile(v12_fraud, 25), np.percentile(v12_fraud, 75) v12_iqr = q75 - q25 v12_cut_off = v12_iqr * 1.5 v12_lower, v12_upper = q25 - v12_cut_off, q75 + v12_cut_off print('V12 Lower: {}'.format(v12_lower)) print('V12 Upper: {}'.format(v12_upper)) outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper] print('V12 outliers: {}'.format(outliers)) print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers))) new_df = under_sample_data.drop(under_sample_data[(under_sample_data['V12'] > v12_upper) | (under_sample_data['V12'] < v12_lower)].index) print('Number of Instances after outliers removal: {}'.format(len(under_sample_data))) print('----' * 44) # Removing outliers V10 Feature v10_fraud = under_sample_data['V10'].loc[under_sample_data['Class'] == 1].values q25, q75 = np.percentile(v10_fraud, 25), np.percentile(v10_fraud, 75) v10_iqr = q75 - q25 v10_cut_off = v10_iqr * 1.5 v10_lower, v10_upper = q25 - v10_cut_off, q75 + v10_cut_off print('V10 Lower: {}'.format(v10_lower)) print('V10 Upper: {}'.format(v10_upper)) outliers = [x for x in v10_fraud if x < v10_lower or x > v10_upper] print('V10 outliers: {}'.format(outliers)) print('Feature V10 Outliers for Fraud Cases: {}'.format(len(outliers))) new_df = under_sample_data.drop(under_sample_data[(under_sample_data['V10'] > v10_upper) | (under_sample_data['V10'] < v10_lower)].index) print('Number of Instances after outliers removal: {}'.format(len(under_sample_data)))
去除异常值,显示结果以下
f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6)) colors = ['#B3F9C5', '#f9c5b3'] sns.boxplot(x="Class", y="V14", data=under_sample_data,ax=ax1, palette=colors) ax1.set_title("V14 Feature \n Reduction of outliers", fontsize=14) ax1.annotate('Fewer extreme \n outliers', xy=(0.98, -17.5), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) sns.boxplot(x="Class", y="V12", data=under_sample_data, ax=ax2, palette=colors) ax2.set_title("V12 Feature \n Reduction of outliers", fontsize=14) ax2.annotate('Fewer extreme \n outliers', xy=(0.98, -17.3), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) sns.boxplot(x="Class", y="V10", data=under_sample_data, ax=ax3, palette=colors) ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14) ax3.annotate('Fewer extreme \n outliers', xy=(0.95, -16.5), xytext=(0, -12), arrowprops=dict(facecolor='black'), fontsize=14) plt.show()
降维的做用是将高维的数据集中没必要要的特征去除,只保留须要的特征属性。此处采用PCA和SVD进行降维分析
import time from sklearn.decomposition import PCA, TruncatedSVD # New_df is from the random undersample data (fewer instances) X = under_sample_data.drop('Class', axis=1) y = under_sample_data['Class'] # PCA Implementation t0 = time.time() X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values) t1 = time.time() print("PCA took {:.2} s".format(t1 - t0)) # TruncatedSVD t0 = time.time() X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values) t1 = time.time() print("Truncated SVD took {:.2} s".format(t1 - t0))
import matplotlib.patches as mpatches f, (ax1, ax2) = plt.subplots(1, 2, figsize=(24,6)) # labels = ['No Fraud', 'Fraud'] f.suptitle('Clusters using Dimensionality Reduction', fontsize=14) blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud') red_patch = mpatches.Patch(color='#AF0000', label='Fraud') # PCA scatter plot ax1.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2) ax1.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2) ax1.set_title('PCA', fontsize=14) ax1.grid(True) ax1.legend(handles=[blue_patch, red_patch]) # TruncatedSVD scatter plot ax2.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 0), cmap='coolwarm', label='No Fraud', linewidths=2) ax2.scatter(X_reduced_svd[:,0], X_reduced_svd[:,1], c=(y == 1), cmap='coolwarm', label='Fraud', linewidths=2) ax2.set_title('Truncated SVD', fontsize=14) ax2.grid(True) ax2.legend(handles=[blue_patch, red_patch]) plt.show()
使用K折交叉验证,对欠采样样本进行建模分析,以Recall值进行评估
def printing_Kfold_scores(x_train_data,y_train_data): fold = KFold(5,shuffle=False) c_param_range = [0.01,0.1,1,10,100] results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score']) results_table['C_parameter'] = c_param_range j = 0 for c_param in c_param_range: recall_accs = [] for iteration, indices in enumerate(fold.split(x_train_data),start=1): lr = LogisticRegression(C=c_param,penalty='l1') lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values) recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample) recall_accs.append(recall_acc) print('Iteration ', iteration,': recall score = ', recall_acc) results_table.ix[j,'Mean recall score'] = np.mean(recall_accs) j += 1 print('') print('Mean recall score ', np.mean(recall_accs)) print('') best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter'] return best_c best_c = printing_Kfold_scores(X_train_undersample,y_train_undersample)
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') import itertools lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample = lr.predict(X_test_undersample.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show()
调节LR模型的threshold,选择最优的阀值。
lr = LogisticRegression(C = 0.01, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] plt.figure(figsize=(10,10)) j = 1 for i in thresholds: y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i plt.subplot(3,3,j) j += 1 # Compute confusion matrix cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s'%i)
分析模型的ROC曲线以及使用此模型对欠采样获得的数据集under_sample进行分析,结果以下。
lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train_undersample,y_train_undersample.values.ravel()) y_pred = lr.predict(X_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(y_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show() y_pred_undersample_score = lr.fit(X_train_undersample,y_train_undersample.values.ravel()).decision_function(X_test_undersample.values) fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score) roc_auc = auc(fpr,tpr) # Plot ROC plt.title('Receiver Operating Characteristic') plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc) plt.legend(loc='lower right') plt.plot([0,1],[0,1],'r--') plt.xlim([-0.1,1.0]) plt.ylim([-0.1,1.01]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
由以前的LR模型对原始数据进行预测分析
best_c = printing_Kfold_scores(X_train,y_train) lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(X_train,y_train.values.ravel()) y_pred_undersample = lr.predict(X_test.values) cnf_matrix = confusion_matrix(y_test,y_pred_undersample) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])) class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix , classes=class_names , title='Confusion matrix') plt.show() lr = LogisticRegression(C = best_c, penalty = 'l1') y_pred_score = lr.fit(X_train,y_train.values.ravel()).decision_function(X_test.values) fpr, tpr, thresholds = roc_curve(y_test.values.ravel(),y_pred_score) roc_auc = auc(fpr,tpr) plt.title('Receiver Operating Characteristic') plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc) plt.legend(loc='lower right') plt.plot([0,1],[0,1],'r--') plt.xlim([-0.1,1.0]) plt.ylim([-0.1,1.01]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
能够看直接使用模型训练并不能得到很好的召回率,也就是说并不能很好的辨别盗刷交易,下面采用前面所说的过采样进行处理。
import numpy as np import pandas as pd import matplotlib.pyplot as plt from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split data = pd.read_csv('../input/creditcard.csv') from sklearn.preprocessing import StandardScaler data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1)) data = data.drop(['Amount','Time'], axis=1)
import itertools from sklearn.linear_model import LogisticRegression from sklearn.model_selection import KFold, cross_val_score from sklearn.metrics import confusion_matrix,recall_score,classification_report def printing_Kfold_scores(x_train_data,y_train_data): fold = KFold(5,shuffle=False) c_param_range = [0.01,0.1,1,10,100] results_table = pd.DataFrame(index=range(len(c_param_range),2), columns=['C_parameter','Mean recall score']) results_table['C_parameter'] = c_param_range j = 0 for c_param in c_param_range: recall_accs = [] for iteration, indices in enumerate(fold.split(x_train_data),start=1): lr = LogisticRegression(C=c_param,penalty='l1') lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel()) y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values) recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample) recall_accs.append(recall_acc) print('Iteration ', iteration,': recall score = ', recall_acc) results_table.ix[j,'Mean recall score'] = np.mean(recall_accs) j += 1 print('') print('Mean recall score ', np.mean(recall_accs)) print('') best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter'] return best_c def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label')
columns=data.columns features_columns=columns.delete(len(columns)-1) features=data[features_columns] labels=data['Class'] features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=0) oversampler=SMOTE(random_state=0) os_features,os_labels=oversampler.fit_sample(features_train,labels_train) os_features = pd.DataFrame(os_features) os_labels = pd.DataFrame(os_labels)
best_c = printing_Kfold_scores(os_features,os_labels) lr = LogisticRegression(C = best_c, penalty = 'l1') lr.fit(os_features,os_labels.values.ravel()) y_pred = lr.predict(features_test.values) # Compute confusion matrix cnf_matrix = confusion_matrix(labels_test,y_pred) np.set_printoptions(precision=2) print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1])) # Plot non-normalized confusion matrix class_names = [0,1] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix') plt.show()