python机器学习基础教程：鸢尾花分类

时间 2019-11-24

原文原文链接

首先导入必要的库：

import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import mglearn
复制代码

from sklearn.datasets import load_iris
iris_dataset = load_iris()
复制代码

load_iris 返回的 iris 对象是一个Bunch对象，与字典很是类似，里面包含键和值. 这个类直接继承dict类，因此咱们能够天然而然地得到dict类地大量功能，好比对键/值的遍历，或者简单查询一个属性是否存在。python

Bunch结构构建方法(应该是在sklearn包里)：

class Bunch(dict):
    
    def __init__(self, *args, **kwds):

        super().__init__(*args, **kwds)
        self.__dict__ = self
复制代码

*参数，能够使函数接受任意数量的位置参数算法

def avg(first, *rest): 
return (first + sum(rest)) / (1 + len(rest))
# Sample use 
avg(1, 2) # 1.5 
avg(1, 2, 3, 4) # 2.5
复制代码

**参数，使函数接受任意数量的关键字参数数组

def make_element(name, value, **attrs): 
....
make_element('item', 'Albatross', size='large', quantity=6)
复制代码

例子：bash

x = Bunch(a='1',b='2',c='3')
print(x.a)
print(x.b)
print(x.c)
输出：
1
2
3
复制代码

T = Bunch
t = T(left = T(left='a',right='b'),right = T(left='c'))
print(t.left)
print(t.left.right)
print(t['left']['right'])
print('left' in t.right)
print('right' in t.right)
输出：
{'left': 'a', 'right': 'b'}
b
b
True
False
复制代码

print('Keys of iris_dataset: \n{}'.format(iris_dataset.keys()))
输出：
Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
复制代码

target_names 键对应的值是一个字符串数组，里面包含咱们要预测的花的品种:dom

print('Target names: {}'.format(iris_dataset['target_names']))
输出：
Target names: ['setosa' 'versicolor' 'virginica']
复制代码

feature_names 键对应的值是一个字符串列表，对每个特征进行了说明:机器学习

print('Feature names: \n{}'.format(iris_dataset['feature_names']))
输出：
Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
复制代码

数据包含在 target 和 data 字段中。data里面是花萼长度、花萼宽度、花瓣长度、花瓣宽度的测量数据,格式为 NumPy 数组。 data数组的每一行对应一朵花，列表明每朵花的四个测量数据 shape函数能够查看矩阵或数组的维数，（150,4）指的是是一个150行，4列的数组:函数

print('Type of data: {}'.format(type(iris_dataset['data'])))
print('Shape of data: {}'.format(iris_dataset['data'].shape))
输出：
Type of data: <class 'numpy.ndarray'>
Shape of data: (150, 4)
复制代码

前 5 个样本的特征数值：学习

print('First five rows of data:\n{}'.format(iris_dataset['data'][:5]))
输出：
First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
复制代码

机器学习中的个体叫做样本（sample），其属性叫做特征（feature）。 data 数组的形状（shape）是样本数乘以特征数。
target 数组包含的是测量过的每朵花的品种，也是一个 NumPy 数组。target 是一维数组，每朵花对应其中一个数据。数字的表明含义由 iris_dataset['target_names'] 数组给出 0 表明 setosa，1 表明 versicolor，2 表明 virginica：测试

print('Type of target:{}'.format(type(iris_dataset['target'])))
print('Shape of target: {}'.format(iris_dataset['target'].shape))
print('Target:\n{}'.format(iris_dataset['target']))
输出：
Type of target:<class 'numpy.ndarray'>
Shape of target: (150,)
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
复制代码

scikit-learn 中的 train_test_split 函数能够打乱数据集并进行拆分。 scikit-learn 中的数据一般用大写的 X 表示，而标签用小写的 y 表示。在对数据进行拆分以前，train_test_split 函数利用伪随机数生成器将数据集打乱为了确保屡次运行同一函数可以获得相同的输出，利用random_state参数指定了随机数生成器的种子。这样函数输出就是固定不变的，因此这行代码的输出始终相同。ui

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'],random_state=0)
复制代码

train_test_split 函数的输出为X_train、X_test、y_train 和 y_test，它们都是NumPy数组。 X_train 包含 75% 的行数据，X_test 包含剩下的25%：

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))

print('X_test shape: {}'.format(X_test.shape))
print('y_test shape: {}'.format(y_test.shape))
输出：
X_train shape: (112, 4)
y_train shape: (112,)

X_test shape: (38, 4)
y_test shape: (38,)
复制代码

利用X_train中的数据建立DataFrame. 利用iris_dataset['feature_names']中的字符串对数据列进行标记. 利用DataFrame建立散点图矩阵，按y_train着色.

iris_dataframe = pd.DataFrame(X_train,columns = iris_dataset['feature_names'])
grr = pd.plotting.scatter_matrix(iris_dataframe,c=y_train,figsize=(15,15),
marker='o',hist_kwds={'bins':20},s=60,alpha=.8,cmap=mglearn.cm3)
复制代码

注：mglearn包为书中做者本身写的包

scikit-learn 中全部的机器学习模型都在各自的类中实现，这些类被称为Estimator 类。 k近邻分类算法是在neighbors模块的KNeighborsClassifier类中实现的。咱们须要将这个类实例化为一个对象，而后才能使用这个模型。这时咱们须要设置模型的参数。 KNeighborsClassifier最重要的参数就是邻居的数目,这里咱们设为1

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
复制代码

knn对象对算法进行了封装,既包括用训练数据构建模型的算法,也包括对新数据点进行预测的算法。它还包括算法从训练数据中提取的信息。对于KNeighborsClassifier来讲,里面只保存了训练集。

想要基于训练集来构建模型，须要调用 knn 对象的 fit 方法，输入参数为 X_train 和y_ train，两者都是 NumPy 数组，前者包含训练数据，后者包含相应的训练标签

knn.fit(X_train,y_train)
复制代码

fit 方法返回的是 knn 对象自己并作原处修改，所以咱们获得了分类器的字符串表示。从中能够看出构建模型时用到的参数

假如在野外发现了一朵鸢尾花，花萼长5cm 宽 2.9cm，花瓣长1cm 宽 0.2cm。想知道这朵鸢尾花属于哪一个品种？咱们能够将这些数据放在一个 NumPy数组中，再次计算形状，数组形状为样本数（1）乘以特征数（4）：

X_new = np.array([[5,2.9,1,0.2]])
复制代码

咱们将这朵花的测量数据转换为二维 NumPy 数组的一行，这是由于 scikit-learn 的输入数据必须是二维数组

调用 knn 对象的 predict 方法来进行预测：

prediction = knn.predict(X_new)
print('Prediction:{}'.format(prediction))
print('Predicted target name:{}'.format(iris_dataset['target_names'][prediction]))
输出：
Prediction:[0]
Predicted target name:['setosa']
复制代码

即根据模型的预测，这朵新的鸢尾花属于类别 0，也就是说它属于 setosa 品种

最后评估模型 对测试数据中的每朵鸢尾花进行预测，并将预测结果与标签（已知的品种）进行对比。咱们能够经过计算精度（accuracy）来衡量模型的优劣，精度就是品种预测正确的花所占的比例：

y_pred = knn.predict(X_test)
print('Test set predictions:\n{}'.format(y_pred))

print('Test set score:{:.2f}'.format(np.mean(y_pred == y_test)))
#保留小数后两位
输出：
Test set predictions:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
Test set score:0.97
复制代码

关于format函数的使用方法详见： www.runoob.com/python/att-…