公号:码农充电站pro
主页:https://codeshellme.github.iohtml
上一篇介绍了 SVM 的原理和一些基本概念,本篇来介绍如何用 SVM 处理实际问题。java
SVM 算法便可以处理分类问题,也能够处理回归问题。python
sklearn 库的 svm 包中实现了下面四种 SVM 算法:git
LinearSVC/R 中默认使用了线性核函数来出处理线性问题。github
SVC/R 可让咱们选择是使用线性核函数仍是高维核函数,当选择线性核函数时,就能够处理线性问题;当选择高维核函数时,就能够处理非线性问题。算法
对于线性问题,LinearSVC/R 比 SVC/R 更优秀,由于 LinearSVC/R 中作了优化,效率更高。shell
若是不知道数据集是否线性可分,可使用 SVC/R 来处理。网络
下面主要介绍分类器,回归器的使用跟分类器大同小异。dom
先来看下 SVC 类的原型:函数
SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)
其中有几个重要的参数:
再来看下 LinearSVC 类的原型:
LinearSVC(penalty='l2', loss='squared_hinge', dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)
在 LinearSVC 类中没有 kernel 参数,由于默认使用了线性核函数。
sklearn 库中自带了一份乳腺癌数据集,下面就使用该数据来构造 SVM 分类器。
这份数据集中采集了一些患者的特征,共包含 569 份数据,每份数据包含 31 个字段,用逗号隔开。在 569 份数据中,一共有 357 个是良性,212 个是恶性。
下面随机抽取了 3 份数据,来看下:
16.13,20.68,108.1,798.8,0.117,0.2022,0.1722,0.1028,0.2164,0.07356,0.5692,1.073,3.854,54.18,0.007026,0.02501,0.03188,0.01297,0.01689,0.004142,20.96,31.48,136.8,1315,0.1789,0.4233,0.4784,0.2073,0.3706,0.1142,0 19.81,22.15,130,1260,0.09831,0.1027,0.1479,0.09498,0.1582,0.05395,0.7582,1.017,5.865,112.4,0.006494,0.01893,0.03391,0.01521,0.01356,0.001997,27.32,30.88,186.8,2398,0.1512,0.315,0.5372,0.2388,0.2768,0.07615,0 13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,0.1885,0.05766,0.2699,0.7886,2.058,23.56,0.008462,0.0146,0.02387,0.01315,0.0198,0.0023,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259,1
下面表格给出了每一列数据表明的含义:
列数 | 含义 | 列数 | 含义 | 列数 | 含义 |
---|---|---|---|---|---|
1 | 半径平均值 | 11 | 半径标准差 | 21 | 半径最大值 |
2 | 文理平均值 | 12 | 文理标准差 | 22 | 文理最大值 |
3 | 周长平均值 | 13 | 周长标准差 | 23 | 周长最大值 |
4 | 面积平均值 | 14 | 面积标准差 | 24 | 面积最大值 |
5 | 平滑程度平均值 | 15 | 平滑程度标准差 | 25 | 平滑程度最大值 |
6 | 紧密度平均值 | 16 | 紧密度标准差 | 26 | 紧密度最大值 |
7 | 凹度平均值 | 17 | 凹度标准差 | 27 | 凹度最大值 |
8 | 凹缝平均值 | 18 | 凹缝标准差 | 28 | 凹缝最大值 |
9 | 对称性平均值 | 19 | 对称性标准差 | 29 | 对称性最大值 |
10 | 分形维数平均值 | 20 | 分形维数标准差 | 30 | 分形维数最大值 |
最后一列的含义表示:是不是良性肿瘤,0 表示恶性肿瘤,1 表示是良性肿瘤。
咱们能够用 load_breast_cancer
函数来加载该数据集:
from sklearn.datasets import load_breast_cancer data = load_breast_cancer()
feature_names
属性存储了每列数据的含义:
>>> print data.feature_names ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
data
属性存储了特征值:
>>> print data.data [[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01] [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02] [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02] ... [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02] [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01] [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
target
属性存储了目标值:
>>> print data.target [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1]
咱们知道数据集的前 30 列是特征值,最后一列是目标值。经过观察特征值可知,30 列的特征值,共有10 个维度,能够分为三大类:
所以,咱们在训练 SVM 模型时,能够只选择其中一类做为训练集时的特征值。
好比,这里咱们选择前 10 列特征做为训练特征(后 20 列忽略):
>>> features = data.data[:,0:10] # 特征集 >>> labels = data.target # 目标集
将数据拆分为训练集和测试集:
from sklearn.model_selection import train_test_split train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.33, random_state=0)
采用 Z-Score 规范化数据:
from sklearn.preprocessing import StandardScaler ss = StandardScaler() train_features = ss.fit_transform(train_features) test_features = ss.transform(test_features)
下面使用 SVC 类构造 SVM 分类器:
from sklearn.svm import SVC svc = svm.SVC() # 均使用默认参数
训练模型:
svc.fit(train_features, train_labels)
用模型作预测:
prediction = svc.predict(test_features)
判断模型准确率:
from sklearn.metrics import accuracy_score score = accuracy_score(test_labels, prediction) >>> print score 0.9414893617021277
能够看到准确率为 94%,说明训练效果仍是很不错的。
这里是一个关于线性分类器 LinearSVC 的示例,你能够了解一下。
sklearn 中实现了 SVM 算法,这里咱们展现了如何用它处理实际问题。
除了 sklearn,LIBSVM 也实现了 SVM 算法,这也是一个很是出名的库,你能够自行研究一下。
(本节完。)
推荐阅读:
欢迎关注做者公众号,获取更多技术干货。