python 实例 SVM SVR cv 核函数 LinearSVR、RBFSampler、 SGDRegressor和 Nystroem的使用

SVM实例,两个数据,两个例子。cancer data样本量小,分类数据用svc函数,较为简单;houseprice样本量大,连续数据,用了支持向量回归SVR,函数先用了RBFSampler和 Nystroem做核映射,然后用SGDRegressor做支持向量回归,使用的这三个函数都很适合大样本。

 

I. 准备

1.import...

In [1]:

import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats,integrate
sns.set_style("darkgrid")
sns.set(color_codes=True)
plt.rcParams['font.sans-serif'] = ['SimHei']  # 绘图时可以显示中文
plt.rcParams['axes.unicode_minus']=False   # 绘图时显示负号
warnings.filterwarnings("ignore")  # 不要显示警告

II. cancer data

1.read data

In [2]:

cancer = pd.read_excel('C:\\Users\\91333\\Documents\\semester6\\data science\\3.NB&DT\\Week3_CancerDataset.xlsx') 

2.数据标准化

In [3]:

cancer_scaled = cancer.iloc[:,1:-1].apply(lambda x: (x - np.mean(x)) / (np.std(x)))

3. 训练SVM模型

对于三种常见的核函数,分别训练SVM模型,并调节参数

(1)线性核函数

线性核函数没有映射到高维度空间,由于是soft margin svm,所以需要调节的参数只有容忍错判的样本的惩罚系数C,用cv结果来挑选C。

In [4]:

from sklearn.svm import SVC,SVR,LinearSVR
from sklearn.model_selection import cross_val_score
C_range = range(1,31)
cv_scores = []
for c in C_range:
    clf = SVC(C=c,kernel = 'linear')
    scores = cross_val_score(clf, cancer_scaled, cancer.iloc[:, -1], cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())
print('当松弛变量C取{}时,score最大,为{}'.format(C_range[cv_scores.index(max(cv_scores))],max(cv_scores)))

plt.figure(figsize=(10,4))
plt.plot(C_range,cv_scores)
plt.xlabel("C")
plt.ylabel("score")
plt.title("score随惩罚系数C变化图")
当松弛变量C取6时,score最大,为0.973728357060408

Out[4]:

Text(0.5, 1.0, 'score随惩罚系数C变化图')

(2)高斯核函数

由于存在两个参数松弛变量C和gamma,使用GridSearchCV函数进行网格搜索。

In [5]:

from sklearn.model_selection import GridSearchCV 
from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
grid1 = GridSearchCV(SVC(kernel = 'rbf'), param_grid={'C': [1,10,20],'gamma': [0.09,0.009,0.0009,0.00009]},
                    scoring=ftwo_scorer, cv=5)
grid1.fit(cancer_scaled, cancer.iloc[:, -1])
print('使用GridSearchCV网格搜索后,最优的松弛变量C取值为{},最优的gamma取值为{},该模型score为{}'.format( 
    grid1.best_estimator_.get_params()['C'],grid1.best_estimator_.get_params()['gamma'],grid1.best_score_))
使用GridSearchCV网格搜索后,最优的松弛变量C取值为10,最优的gamma取值为0.009,该模型score为0.9882421118919705

(3)多项式核函数

存在两个参数松弛变量C和阶数,同样利用GridSearchCV函数

In [6]:

grid2 = GridSearchCV(SVC(kernel = 'poly'), param_grid={'C': [1,5,10,15,20],'degree': [1,2,3,4,5,6,7,8,9]},
                    scoring=ftwo_scorer, cv=5)
grid2.fit(cancer_scaled, cancer.iloc[:, -1])
print('使用GridSearchCV网格搜索后,最优的松弛变量C取值为{},最优的阶数取值为{},该模型score为{}'.format( 
    grid2.best_estimator_.get_params()['C'],grid2.best_estimator_.get_params()['degree'],grid2.best_score_))

plt.scatter(x=grid2.cv_results_['param_C'].data, y=grid2.cv_results_['param_degree'].data,
            s=1000*(grid2.cv_results_['mean_test_score']-min(grid2.cv_results_['mean_test_score'])+0.1),
            cmap=plt.cm.get_cmap('RdYlBu'))
plt.xlabel("C")
plt.ylabel("degree")
plt.title("score随惩罚系数C和degree变化散点图")
plt.annotate('圆圈大小表示score大小', xy=(1,1), xytext=(1, -1),color='b',size=10)
plt.annotate('这个点最优',xy=(1,1),xytext=( grid2.best_estimator_.get_params()['C']+1,grid2.best_estimator_.get_params()['degree']),color='r')
使用GridSearchCV网格搜索后,最优的松弛变量C取值为20,最优的阶数取值为3,该模型score为0.9867086284245016

Out[6]:

Text(21, 3, '这个点最优')

奇偶degree取值使得score交替变化,奇特。

4. 写出超平面

尝试了一下,只有没有映射到高维空间的线性核函数SVM可以取出超平面表达式,设超平面h(x)=wx+b。

(1)方向w

In [7]:

clf = SVC(kernel = 'linear')
clf.fit(cancer_scaled, cancer.iloc[:, -1])
print(clf.coef_)
[[-8.80113094e-02 -3.59306843e-01 -3.35013827e-01 -8.39562339e-04
   6.46330866e-01 -7.45104189e-01 -9.27479275e-01 -7.01658929e-02
   3.81723561e-01 -8.64975132e-01  3.24937431e-01 -2.37246961e-01
  -8.85892419e-01 -3.70854979e-01  3.81626684e-01  3.71302098e-01
  -4.33843461e-01  8.46263549e-02  8.92709596e-01 -6.49610348e-01
  -1.00930557e+00 -3.76030776e-01 -7.55483286e-01 -3.94390435e-01
   1.53334795e-01 -1.02528284e+00 -1.18490959e-01 -4.30099332e-01
  -8.64181126e-01]]

这是一个和样本feature数同维数的向量,31维。

(2)截距b

In [8]:

print(clf.intercept_ )
[0.00091257]

III. houseprice

1. read data

In [9]:

cal_housing = pd.read_csv('C:\\Users\\91333\\Documents\\semester6\\data science\\5.SVM\\房价预测\\cal_housing.data',
                          header=None,names=['longitude','latitude','Age','Rooms','Bedrooms','population','households','Income','HouseValue'])

2.探索一下

1.前5行

In [10]:

cal_housing.head(5)

Out[10]:

  longitude latitude Age Rooms Bedrooms population households Income HouseValue
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0

2.可视化

(1)经纬度和房屋价格的散点图

In [11]:

plt.figure(figsize=[17,7])
plt.subplot(1,2,1)
plt.scatter(cal_housing['longitude'].values,cal_housing['HouseValue'].values,alpha=0.05)
plt.xlabel("longitude")
plt.ylabel("MedianHouseValue")
plt.title("scatterplot of longitude and MedianHouseValue",fontsize="x-large")
plt.subplot(1,2,2)
plt.scatter(cal_housing['latitude'].values,cal_housing['HouseValue'].values,alpha=0.05)
plt.xlabel("latitude")
plt.ylabel("MedianHouseValue")
plt.title("scatterplot of latitude and MedianHouseValue",fontsize="x-large")

Out[11]:

Text(0.5, 1.0, 'scatterplot of latitude and MedianHouseValue')

无论经度还是纬度,都主要有两种区间:第一种区间上,房屋价格的分布考下,很多都在200000以下;第二种区间上,各种价格的房屋都存在,而且都中间价格的多,点更密集,两头更稀疏。图中的规律和现实大致相符,在主城区各种价格的房屋都有,而在乡村偏远地区,主要是价格低的房屋。

2)收入和房屋价格的核密度图

In [12]:

sns.jointplot(x='Income',y='HouseValue',data=cal_housing,kind='kde')

Out[12]:

<seaborn.axisgrid.JointGrid at 0x267bd149c88>

收入和房屋价格正相关,符合生活经验。 其他变量与房屋价格的双变量图像效果不好,我就没展示。

3. 数据预处理

(1)数据标准化

In [13]:

cal_housing_scaled = cal_housing.apply(lambda x: (x - np.mean(x)) / (np.std(x)))

(2)主成分分析

主成分分析的目的是降维,加快后面SVR的速度。

In [14]:

from sklearn.decomposition import PCA
pca=PCA().fit(cal_housing_scaled.iloc[:,0:-1])
print('方差累计贡献率为{}'.format(np.cumsum(pca.explained_variance_ratio_)))
方差累计贡献率为[0.48833528 0.7268287  0.86082389 0.96357371 0.98208055 0.99228857
 0.99815104 1.        ]

前四个主成分累计贡献了0.96的方差变异,所以选取前四个主成分就可以。

In [15]:

pca_cal=pca.fit_transform(cal_housing_scaled.iloc[:,0:-1])[:,0:4]

4.训练SVR模型

(1)线性核函数SVR

sklearn的帮助文档说,当线性核函数时,由于LinearSVR用的是liblinear,而SVR用的是libsvm,LinearSVR函数要比SVR函数快,我试了一下果然。

a.PCA后

In [16]:

import time
start = time.time()
cv_scores_pca_linear = []
for c in C_range:
    svr = LinearSVR(C=1/c)
    scores = cross_val_score(svr, pca_cal, cal_housing_scaled.iloc[:, -1], cv=5)
    cv_scores_pca_linear.append(scores.mean())
end = time.time()

print('当松弛变量C取{}时,score最大,为{},运行时间为{}秒'.format(
    C_range[cv_scores_pca_linear.index(max(cv_scores_pca_linear))],max(cv_scores_pca_linear),end-start))
plt.figure(figsize=(10,4))
plt.plot(C_range,cv_scores_pca_linear)
plt.xlabel("C")
plt.ylabel("score")
plt.title("PCA后LinearSVR函数score随惩罚系数C变化图")
当松弛变量C取26时,score最大,为0.4471613979532877,运行时间为62.31103038787842秒

Out[16]:

Text(0.5, 1.0, 'PCA后LinearSVR函数score随惩罚系数C变化图')

b.PCA前

In [17]:

start = time.time()
cv_scores_linear = []
for c in C_range:
    svr = LinearSVR(C=1/c)
    scores = cross_val_score(svr, cal_housing_scaled.iloc[:, 0:-1], cal_housing_scaled.iloc[:, -1], cv=5)
    cv_scores_linear.append(scores.mean())
end = time.time()
print('当松弛变量C取{}时,score最大,为{},运行时间为{}秒'.format(
    C_range[cv_scores_linear.index(max(cv_scores_linear))],max(cv_scores_linear),end-start))

plt.figure(figsize=(10,4))
plt.plot(C_range,cv_scores)
plt.xlabel("C")
plt.ylabel("score")
plt.title("PCA前LinearSVR函数score随惩罚系数C变化图")
当松弛变量C取10时,score最大,为0.5756157136922346,运行时间为84.73883938789368秒

Out[17]:

Text(0.5, 1.0, 'PCA前LinearSVR函数score随惩罚系数C变化图')

PCA前的总运行时间比PCA后的总运行时间多了约20秒,但是正确率低了10%以上,无法容忍,所以下面用PCA前的数据。

(2)高斯核函数

a.核函数映射用RBFSampler函数

借助两个函数RBFSampler和SGDRegressor函数,利用RBFSampler先把原数据映射到feature空间,然后再进行SVR操作。 帮助文档说,SGDRegressor函数也可以做SVR,参数alpha是惩罚系数,比LinearSVR的C要小很多,alpha的默认值alpha=0.0001,C的默认值是1,我也不知道是怎么定义的。 SGDRegressor函数真的好快,帮助文档推荐1万以上的样本用SGDRegressor函数。

In [18]:

from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDRegressor
start = time.time()
alpha_range=[-1,-3,-5,-7,-9,-11]
gamma_range=[-1,-1.5,-2,-2.5,-3]

cv_scores_RBFS=np.zeros(shape=(len(alpha_range),len(gamma_range)))
for i in range(len(alpha_range)):
    sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i])
    for j in range(len(gamma_range)):
        rbf_feature = RBFSampler(gamma = 10**gamma_range[j], random_state=1)
        X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1])
        scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5)
        cv_scores_RBFS[i,j] = scores.mean()
end = time.time()     
m, n = cv_scores_RBFS.shape
index = int(cv_scores_RBFS.argmax())
i = int(index / n)
j = index % n
print('当惩罚系数alpha取10^{}、高斯核函数的gamma值取10^{}时,score最大,为{},运行时间为{}秒'.format(alpha_range[i],gamma_range[j],cv_scores_RBFS.max(),end-start))
print(cv_scores_RBFS)
当惩罚系数alpha取10^-11、高斯核函数的gamma值取10^-1.5时,score最大,为0.5706462592558829,运行时间为100.89476466178894秒
[[ 0.12197254  0.09333182 -0.01330377 -0.08902908 -0.11553457]
 [ 0.55536832  0.5593231   0.52558767  0.46167317  0.28025331]
 [ 0.56701374  0.57042556  0.53751988  0.49345061  0.39564883]
 [ 0.5656155   0.5686655   0.53074254  0.49179496  0.39597599]
 [ 0.5672084   0.56981525  0.53735416  0.49079757  0.39544188]
 [ 0.5636954   0.57064626  0.53562283  0.49084243  0.39619231]]

b.核函数映射用Nystroem函数

In [19]:

from sklearn.kernel_approximation import Nystroem
start = time.time()
alpha_range=[-1,-3,-5,-7,-9,-11]
gamma_range=[-1,-1.5,-2,-2.5,-3]
cv_scores_Nyst=np.zeros(shape=(len(alpha_range),len(gamma_range)))
for i in range(len(alpha_range)):
    sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i])
    for j in range(len(gamma_range)):
        rbf_feature = Nystroem(gamma = 10**gamma_range[j], random_state=1)
        X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1])
        scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5)
        cv_scores_Nyst[i,j] = scores.mean()
end = time.time()     
m, n = cv_scores_Nyst.shape
index = int(cv_scores_Nyst.argmax())
i = int(index / n)
j = index % n
print('当惩罚系数alpha取10^{}、高斯核函数的gamma值取10^{}时,score最大,为{},运行时间为{}秒'.format(alpha_range[i],gamma_range[j],cv_scores_Nyst.max(),end-start))
print(cv_scores_Nyst)
当惩罚系数alpha取10^-5、高斯核函数的gamma值取10^-1时,score最大,为0.590604308123932,运行时间为100.88983869552612秒
[[ 0.14738545  0.11121301 -0.00634036 -0.08838434 -0.11977817]
 [ 0.58485212  0.57181939  0.51956549  0.44955772  0.29771895]
 [ 0.59060431  0.5851441   0.52589562  0.48105512  0.39730037]
 [ 0.5878266   0.58427723  0.52762529  0.47229848  0.3941709 ]
 [ 0.58944038  0.58054867  0.52827467  0.47057277  0.38794852]
 [ 0.58958172  0.58522741  0.53486099  0.47879231  0.39311045]]

帮助文档说Nystroem函数比RBFSample慢一些,但更加精确,和代码运行结果大致相符合。

(3)多项式核函数

In [20]:

start = time.time()
alpha_range=[-1,-3,-5,-7,-9,-11]
degree_range=[1,2,3,4,5,6,7,8,9]
cv_scores_poly=np.zeros(shape=(len(alpha_range),len(degree_range)))
for i in range(len(alpha_range)):
    sdgclf=SGDRegressor(loss="epsilon_insensitive",alpha=10**alpha_range[i])
    for j in range(len(degree_range)):
        rbf_feature = Nystroem(degree =degree_range[j], random_state=1)
        X_features = rbf_feature.fit_transform(cal_housing_scaled.iloc[:,0: -1])
        scores = cross_val_score(sdgclf, X_features, cal_housing_scaled.iloc[:, -1], cv=5)
        cv_scores_poly[i,j] = scores.mean()
end = time.time()     
m, n = cv_scores_poly.shape
index = int(cv_scores_poly.argmax())
i = int(index / n)
j = index % n
print('当惩罚系数alpha取10^{}、多项式核函数的degree值取{}时,score最大,为{},运行时间为{}秒'.format(alpha_range[i],degree_range[j],cv_scores_poly.max(),end-start))
print(cv_scores_poly)
当惩罚系数alpha取10^-5、多项式核函数的degree值取3时,score最大,为0.5836778542094929,运行时间为245.93922114372253秒
[[0.1387308  0.14087102 0.13907231 0.14932641 0.14683597 0.14708831
  0.14242401 0.13844941 0.14060674]
 [0.56763798 0.56812605 0.56974842 0.56641043 0.56690536 0.57121851
  0.56881908 0.56988619 0.56971767]
 [0.57845136 0.57844851 0.58367785 0.57892279 0.58108542 0.58091855
  0.57843517 0.58109156 0.57894842]
 [0.58244182 0.57824064 0.57810287 0.58273207 0.57906801 0.5822089
  0.57929546 0.57593165 0.58037378]
 [0.5791144  0.5810481  0.57533389 0.58261548 0.58022263 0.57767742
  0.58082774 0.58168407 0.5816219 ]
 [0.57711384 0.57705246 0.58137879 0.57963625 0.57937285 0.58090316
  0.58025604 0.5781358  0.57887552]]