Python机器学习（基础篇---监督学习（线性分类器））

监督学习经典模型算法

机器学习中的监督学习模型的任务重点在于，根据已有的经验知识对未知样本的目标/标记进行预测。根据目标预测变量的类型不一样，咱们把监督学习任务大致分为分类学习与回归预测两类。监督学习任务的基本流程：首先准备训练数据，能够是文本、图像、音频等；而后抽取所须要的特征，造成特征向量，接着把这些特征向量连同对应的标记/目标（Labels）一并送入学习算法中，训练一个预测模型，而后采用一样的特征抽取方法做用于新测试数据，获得用于测试的特征向量，最后使用预测模型对这些待测试的特征向量进行预测并获得结果。dom

1.分类学习机器学习

最基础的是二分类问题，即判断是非，从两个类别中选择一个做为预测结果。多分类问题，即在多余两个类别中选择一个，多标签分类问题，判断一个样本是否同时属于多个不一样类别。函数

1.1线性分类器工具

模型介绍：线性分类器是一种假设特征与分类结果存在线性关系的模型。经过累加计算每一个维度的特征与各自权重的乘积来帮助类别决策。性能

若是咱们定义x=<x1,x2,...,xn>来表明n维特征列向量，同时用n维列向量w=<w1,w2,...wn>来表明对应得权重，避免坐标过坐标原点，假设截距为b。线性关系可表达为：学习

f（w,x,b）=w^Tx+b测试

咱们所要处理的简单二分类问题但愿f∈{0,1}；所以须要一个函数把原先的f∈R映射到（0,1），逻辑斯蒂函数：spa

g(z)=1/(1+e^-z)code

将z替换为f，逻辑斯蒂回归模型：

h_w,b(x)=g(f(w,x,b))=1/(1+e^-f)=1/(1+e-^(wTx+b)

实例1：良/恶性乳腺癌肿瘤预测----------逻辑斯蒂回归分类器

数据描述：

Number of Instances: 699 (as of 15 July 1992)

Number of Attributes: 10 plus the class attribute
Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain

   -- -----------------------------------------

   1. Sample code number            id number

   2. Clump Thickness               1 - 10

   3. Uniformity of Cell Size       1 - 10

   4. Uniformity of Cell Shape      1 - 10

   5. Marginal Adhesion             1 - 10

   6. Single Epithelial Cell Size   1 - 10

   7. Bare Nuclei                   1 - 10

   8. Bland Chromatin               1 - 10

   9. Normal Nucleoli               1 - 10

  10. Mitoses                       1 - 10

  11. Class:                        (2 for benign, 4 for malignant)

Missing attribute values: 16

   There are 16 instances in Groups 1 to 6 that contain a single missing

   (i.e., unavailable) attribute value, now denoted by "?".

Class distribution:

   Benign: 458 (65.5%)

   Malignant: 241 (34.5%)

#步骤一：良/恶性乳腺癌肿瘤数据预处理

#导入pandas与numpy工具包

import pandas as pd

import numpy as np

#建立特征列表

column_names=['Sample code number','Clump Thickness','Uniformity of Cell Size',

              'Uniformity of Cell Shape','Marginal Adhesion',

              'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin',

              'Normal Nucleoli','Mitoses','Class']

#使用pandas.read_csv函数从互联网读取指定数据

data=pd.read_csv('

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names=column_names)

# print(data)#[699 rows x 11 columns]

# print(data[:5])

#Sample code number Clump Thickness Uniformity of Cell Size \

#0 1000025 5 1

#1 1002945 5 4

#2 1015425 3 1

#3 1016277 6 8

#4 1017023 4 1

#Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size \

#0 1 1 2

#1 4 5 7

#2 1 1 2

#3 8 1 3

#4 1 3 2

#Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class

#0 1 3 1 1 2

#1 10 3 2 1 2

#2 2 3 1 1 2

#3 4 3 7 1 2

#4 1 3 1 1 2

data=data.replace(to_replace='?',value=np.nan)

data=data.dropna(how='any')

print(data.shape)#(683, 11)

#步骤二：准备良/恶性乳腺癌肿瘤训练、测试数据

#使用sklearn.cross_validation里的train_test_split模块用于分割数据

from sklearn.cross_validation import train_test_split

#随机采样25%的数据用于测试，剩下的75%用于构建训练集合

X_train,X_test,y_train,y_test=train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)

#检查训练样本的数量和类别分布

print(y_train.value_counts())

# 2 344

# 4 168

# Name: Class, dtype: int64

print(y_test.value_counts())

# 2 100

# 4 71

# Name: Class, dtype: int64

#步骤三：使用线性分类模型从事良/恶性肿瘤预测任务 #从sklearn.preprocessing里导入StandardScaler from sklearn.preprocessing import StandardScaler #从sklearn.preprocessing里导入LogisticRegression与SGDClassifier from sklearn.linear_model import LogisticRegression from sklearn.linear_model import SGDClassifier #标准化数据，保证每一个维度的特征数据方差为1，均值为0。使得预测结果不会被某些维度过大的特征值而主导 ss=StandardScaler() X_train=ss.fit_transform(X_train) X_test=ss.fit_transform(X_test) #初始化LogisticRegression与SGDClassifier lr=LogisticRegression() sgdc=SGDClassifier() #调用LogisticRegression中的fit函数/模块用来训练模型参数 lr.fit(X_train,y_train) #使用训练好的模型lr对X_test lr_y_predict=lr.predict(X_test) #调用SGDClassifier中的fit函数/模块用来训练模型参数 sgdc.fit(X_train,y_train) sgdc_y_predict=sgdc.predict(X_test) #步骤四：使用线性分类模型从事良/恶性肿瘤预测任务的性能分析 #从sklearn.metrics里导入classification_report模块 from sklearn.metrics import classification_report #使用逻辑斯蒂回归模型自带的评分函数score得到模型在测试集上的准确性结果 print('Accuracy of LR Classifier:',lr.score(X_test,y_test)) #利用classification_report模块得到LogisticRegression其余三个指标的结果。 print(classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant'])) #使用随机梯度降低模型自带的评分函数score得到模型在测试集上的准确性结果 print('Accuracy of SGD Classifier:',sgdc.score(X_test,y_test)) #利用classification_report模块得到LogisticRegression其余三个指标的结果。 print(classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignant']))