[机器学习实践] 针对Breast-Cancer数据集

时间 2019-12-07

标签机器学习实践针对 breast cancer 数据繁體版

原文原文链接

本篇博客中，咱们将对一个UCI数据库中的数据集：Breast-Cancer数据集，应用已有的机器学习方法来实现一个分类器。html

本文代码连接数据库

数据集概况

数据集的地址为：linkdom

在该页面中，能够进入Data Set Description 来查看数据的说明文档，另一个链接是Data Folder 查看数据集的下载地址。机器学习

这里咱们使用的文件是：函数

breast-cancer-wisconsin.data
breast-cancer-wisconsin.names

即：性能

这两个文件，第一个文件（链接）是咱们的数据文件，第二个文件（链接）是数据的说明文档。学习

对于这样的一份数据，咱们应该首先阅读说明文档中的内容来对数据有一个基本的了解。测试

对数据的预处理

咱们能够知道文件有11个列，第1个列为id号，第2-10列为特征，11列为标签（2为良性、4为恶性）。具体的特征内容在文档中，可是咱们能够不关心医学上的具体意义，这部分在文档中的描述以下：.net

7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

另外从文档中咱们还能够知道一些其余的信息：code

数据集中共有699条信息
数据集中有16处缺失值，缺失值使用"?"表示
数据集中良性数据有458条，恶性数据有241条

缺失值处理和分割数据集

由于缺失的数据很少（11条），因此咱们暂时先采用丢弃带有“？”的数据，加上前面读取数据、添加表头的操做，代码以下：

# import the packets
import numpy as np
import pandas as pd

DATA_PATH = "breast-cancer-wisconsin.data"

# create the column names
columnNames = [
    'Sample code number',
    'Clump Thickness',
    'Uniformity of Cell Size',
    'Uniformity of Cell Shape',
    'Marginal Adhesion',
    'Single Epithelial Cell Size',
    'Bare Nuclei',
    'Bland Chromatin',
    'Normal Nucleoli',
    'Mitoses',
    'Class'
]

data = pd.read_csv(DATA_PATH, names = columnNames)
# show the shape of data
print data.shape

# use standard missing value to replace "?"
data = data.replace(to_replace = "?", value = np.nan)
# then drop the missing value
data = data.dropna(how = 'any')

print data.shape

输出结果为：

(699, 11)
(683, 11)

能够看到，如今数据中带有缺失值的数据都被丢弃掉了。

咱们能够经过相似 data['Class'] 的方式来访问特定的属性，以下图：

而后咱们会将数据集分割为两部分：训练数据集和测试数据集，使用了train_test_split，这个函数已经自动完成了随机分割的功能，函数文档。

而后咱们分割数据集：

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[ columnNames[1:10] ], # features
    data[ columnNames[10]   ], # labels
    test_size = 0.25,
    random_state = 33
)

获得的变量为：

X_train ：训练数据集的特征
X_test ：测试数据集的特征
y_train ：训练数据集的标签
y_test ：测试数据集的标签

由于是监督学习，因此全部数据都有标签，且认为标签的内容百分之百准确。

应用机器学习模型

应用机器模型前，应该将每一个特征的数值转化为均值为0，方差为1的数据，使训练出的模型不会被某些维度过大的值主导。

这里使用的使scikit-learn 中的 StandardScaler 模块，doc连接。

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train = ss.fit_transform(X_train) # fit_transform for train data
X_test = ss.transform(X_test)

而后咱们将创建一个机器学习模型，这里咱们使用了Logestic Regression 和 SVM：

# use logestic-regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_y = lr.predict(X_test)

# use svm
from sklearn.svm import LinearSVC
lsvc = LinearSVC()
lsvc.fit(X_train, y_train)
svm_y = lsvc.predict(X_test)

分类器的效果评估

首先咱们用分类器自带的.score方法来对准确性进行打印：

# now we will check the performance of the classifier
from sklearn.metrics import classification_report
# use the classification_report to present result
# `.score` method can be used to test the accuracy
print 'Accuracy of the LogesticRegression: ', lr.score(X_test, y_test)
# print 'Accuracy on the train dataset: ', lr.score(X_train, y_train)
# print 'Accuracy on the predict result (should be 1.0): ', lr.score(X_test, lr_y)
print 'Accuracy of the SVM: ' , lsvc.score(X_test, y_test)

输出为：

Accuracy of the LogesticRegression:  0.953216374269
Accuracy of the SVM:  0.959064327485

除此之外，咱们还可使用classification_report对分类器查看更详细的性能测试结果：

print classification_report(y_test, svm_y, target_names = ['Benign', 'Malignant'])

其结果以下：

precision    recall  f1-score   support

     Benign       0.96      0.98      0.97       111
  Malignant       0.96      0.92      0.94        60

avg / total       0.96      0.96      0.96       171