SVM:硬/最大间隔SVM(手撕原理)

示意图

摘自Bing图片

二分类问题描述

D a t a = { ( x i , y i ) } i = 1 N , x i R p , y i { 1 , + 1 } Data=\{(x_i, y_i)\}_{i=1}^N,x_i\in\R^p,y_i\in\{-1,+1\}
由于超平面 ω T x + b \omega^Tx+b 有很多个,要找到最好的一个超平面,以得到最低的泛化误差(或测试误差、期望损失)。

hard-margin SVM判别模型,与概率无关:
f ( ω ) = s i g n ( ω T x + b ) = { ω T x + b > 0 , f ( ω ) = 1 ω T x + b < 0 , f ( ω ) = 1 f(\omega)=sign(\omega^Tx+b)=\begin{cases}\omega^Tx+b>0,f(\omega)=1\\\omega^Tx+b<0,f(\omega)=-1\end{cases}
目标函数:
{ m a x   m a r g i n ( ω , b ) s . t .   { ω T x i + b > 0 , y i = 1 ω T x i + b < 0 , y i = 1 y i ( ω T x i + b ) > 0 , i = 1... , N \begin{cases}max\space margin(\omega,b) \\ s.t.\space \begin{cases}\omega^Tx_i+b>0,y_i=1\\\omega^Tx_i+b<0,y_i=-1\end{cases}\Rightarrow y_i(\omega^Tx_i+b)>0,i=1...,N\end{cases}
即, { m a x   m a r g i n ( ω , b ) s . t .   y i ( ω T x i + b ) > 0 , i = 1 , . . . , N \begin{cases}max\space margin(\omega,b) \\ s.t.\space y_i(\omega^Tx_i+b)>0,i=1,...,N\end{cases}

什么是margin?
答:一共有N个点到直线的距离,最小的那个就是margin。点到直线距离公式,
m a r g i n ( ω , b ) = m i n ω , b , x i d i s t a n c e ( ω , b , x i ) = m i n ω , b , x i 1 ω ω T x i + b margin(\omega,b)={min \atop \omega,b,x_i}distance(\omega,b,x_i)={min \atop \omega,b,x_i}{\frac 1 {\parallel\omega\parallel}}\mid\omega^Tx_i+b\mid

则上式写为:
{ m a x ω , b m i n ω , b , x i 1 ω ω T x i + b   = m a x ω , b m i n x i 1 ω ω T x i + b = m a x ω , b 1 ω m i n x i y i ( ω T x i + b ) y i { 1 , + 1 } s . t .   y i ( ω T x i + b ) > 0 \begin{cases}{max \atop \omega,b}{min \atop \omega,b,x_i}{\frac 1 {\parallel\omega\parallel}}\mid\omega^Tx_i+b\mid\space ={max \atop \omega,b}{min \atop x_i}{\frac 1 {\parallel\omega\parallel}}\mid\omega^Tx_i+b\mid={max \atop \omega,b}{\frac 1 {\parallel\omega\parallel}}{min \atop x_i}y_i(\omega^Tx_i+b)\Larr y_i\in\{-1,+1\} \\ s.t.\space y_i(\omega^Tx_i+b)>0\end{cases}

y i ( ω T x i + b ) > 0 y_i(\omega^Tx_i+b)>0 可以理解为:   γ > 0 , s . t .   m i n x i , y i y i ( ω T x i + b ) = γ \exist\space\gamma>0,s.t.\space {min \atop x_i,y_i}y_i(\omega^Tx_i+b)=\gamma
γ \gamma 的取值对式子(或超平面)是没有影响的,实际上就是对 ω , b \omega,b 的缩放。
因此,令 γ = 1 \gamma=1
则, m a x ω , b 1 ω m i n x i y i ( ω T x i + b ) = m a x ω , b 1 ω γ = m a x ω , b 1 ω {max \atop \omega,b}{\frac 1 {\parallel\omega\parallel}}{min \atop x_i}y_i(\omega^Tx_i+b)={max \atop \omega,b}{\frac 1 {\parallel\omega\parallel}}\gamma={max \atop \omega,b}{\frac 1 {\parallel\omega\parallel}}
则,上式可写为:
{ m a x ω , b 1 ω m i n ω , b ω = m i n ω , b 1 2 ω T ω   s . t .   m i n x i y i ( ω T x i + b ) = 1 y i ( ω T x i + b ) 1 , i = 1 , . . . , N   N \begin{cases}{max \atop \omega,b}{\frac 1 {\parallel\omega\parallel}}\Rarr{min \atop \omega,b}{\parallel\omega\parallel}={min \atop \omega,b}{\frac 1 2}\omega^T\omega\space硬间隔;二次的、凸优化,可直接求解 \\s.t.\space {min\atop x_i}y_i(\omega^Tx_i+b)=1\Rarr y_i(\omega^Tx_i+b)\geqslant1,i=1,...,N\space 有N个约束 \end{cases}

则, ( 1 ) { m i n ω , b 1 2 ω T ω   s . t .   y i ( ω T x i + b ) 1 , i = 1 , . . . , N (1)\begin{cases}{min \atop \omega,b}{\frac 1 2}\omega^T\omega\space \\s.t.\space y_i(\omega^Tx_i+b)\geqslant1,i=1,...,N \end{cases}

开始求解

1. Primal problem:带 ω , b \omega,b 约束的优化

( 1 ) { m i n ω , b 1 2 ω T ω s . t .   y i ( ω T x i + b ) 1 , f o r   i = 1 , . . . , N 1 y i ( ω T x i + b ) 0 (1)\begin{cases} {min \atop \omega,b}{\frac 1 2}\omega^T\omega \\ s.t. \space y_i(\omega^Tx_i+b)\geqslant1,for \space i=1,...,N \xLeftrightarrow{}1-y_i(\omega^Tx_i+b)\leqslant0\end{cases}

2. 拉格朗日乘子法→对 ω , b \omega,b 无约束的优化

L ( ω , b , λ ) = 1 2 ω T ω + i = 1 N λ i ( 1 y i ( ω T x i + b ) ) L(\omega,b,\lambda)={\frac 1 2}{\omega^T}\omega+\displaystyle\sum_{i=1}^N\lambda_i(1-y_i(\omega^Tx_i+b)) , λ i 0 \lambda_i\geqslant0
( 2 ) { m i n ω , b m a x λ L ( ω , b , λ ) s . t .   λ i 0 (2)\begin{cases}{min \atop \omega,b} {max \atop \lambda}L(\omega,b,\lambda) \\ s.t.\space\lambda_i\geqslant0\end{cases}

值得注意的是: 1 y i ( ω T x i + b ) 0 1-y_i(\omega^Tx_i+b)\leqslant0 。为什么呢?
答:
直观上看,
如果 1 y i ( ω T x i + b ) > 0 1-y_i(\omega^Tx_i+b)>0 ,则 m a x λ L = 1 2 ω T ω + = {max\atop\lambda}L={\frac 1 2}{\omega^T}\omega+\infty=\infty
如果 1 y i ( ω T x i + b ) 0 1-y_i(\omega^Tx_i+b)\leqslant0 ,则 m a x λ L {max\atop\lambda}L 一定存在, m a x λ L = 1 2 ω T ω + 0 = 1 2 ω T ω   ( λ i 0 ) {max\atop\lambda}L={\frac 1 2}{\omega^T}\omega+0={\frac 1 2}{\omega^T}\omega\space(\lambda_i\rarr0)
则, m i n ω , b m a x λ L ( ω , b , λ ) = m i n ω , b ( , 1 2 ω T ω ) = 1 2 ω T ω {min \atop \omega,b} {max \atop \lambda}L(\omega,b,\lambda)={min \atop \omega,b} (\infty,{\frac 1 2}{\omega^T}\omega)={\frac 1 2}{\omega^T}\omega
因此, 1 y i ( ω T x i + b ) > 0 1-y_i(\omega^Tx_i+b)>0 被丢弃了。

3. 转化为强对偶问题

( 3 ) { m a x λ m i n ω , b L ( ω , b , λ ) s . t .   λ i 0 (3)\begin{cases}{max \atop \lambda}{min \atop \omega,b}L(\omega,b,\lambda) \\s.t.\space \lambda_i \geqslant0\end{cases}

什么是强、弱对偶?
答:凸优化二次问题满足强对偶关系。
(1)弱对偶关系为 m i n   m a x L m a x   m i n L min\space maxL\geqslant max\space minL ,对应理解为“尾凤 \geqslant 头鸡”,即凤尾优于鸡头、瘦死的骆驼比马大。
(2)强对偶关系,就是把 \geqslant 改为=。

4. 求解对偶问题:解拉格朗日方程 m i n ω , b L ( ω , b , λ ) {min \atop \omega,b}L(\omega,b,\lambda)

L ( ω , b , λ ) = 1 2 ω T ω + i = 1 N λ i ( 1 y i ( ω T x i + b ) ) L(\omega,b,\lambda)={\frac 1 2}{\omega^T}\omega+\displaystyle\sum_{i=1}^N\lambda_i(1-y_i(\omega^Tx_i+b)) , λ i 0 \lambda_i\geqslant0

(1) 求 L b = 0 {\frac {\partial L} {\partial b}}=0

L b = b [ i = 1 N λ i i = 1 N λ i y i ( ω T x i + b ) ] = b [ i = 1 N λ i y i b ) ] = i = 1 N λ i y i = 0 {\frac {\partial L} {\partial b}}={\frac {\partial }{\partial b}}[\displaystyle\sum_{i=1}^N\lambda_i-\displaystyle\sum_{i=1}^N\lambda_iy_i(\omega^Tx_i+b)]={\frac {\partial }{\partial b}}[-\displaystyle\sum_{i=1}^N\lambda_iy_ib)]\\=-\displaystyle\sum_{i=1}^N\lambda_iy_i=0
则, i = 1 N λ i y i = 0 \displaystyle\sum_{i=1}^N\lambda_iy_i=0

(2) 将 i = 1 N λ i y i = 0 \displaystyle\sum_{i=1}^N\lambda_iy_i=0 代入到 L ( ω , b , λ ) L(\omega,b,\lambda)

L ( ω , b , λ ) = 1 2 ω T ω + i = 1 N λ i i = 1 N λ i y i ( ω T x i + b ) = 1 2 ω T ω + i = 1 N λ i i = 1 N λ i y i ω T x i + i = 1 N λ i y i b = 1 2 ω T ω + i = 1 N λ i i = 1 N λ i y i ω T x i L(\omega,b,\lambda)={\frac 1 2}\omega^T\omega+\displaystyle\sum_{i=1}^N\lambda_i-\displaystyle\sum_{i=1}^N\lambda_iy_i(\omega^Tx_i+b)\\={\frac 1 2}\omega^T\omega+\displaystyle\sum_{i=1}^N\lambda_i-\displaystyle\sum_{i=1}^N\lambda_iy_i\omega^Tx_i+\displaystyle\sum_{i=1}^N\lambda_iy_ib\\={\frac 1 2}\omega^T\omega+\displaystyle\sum_{i=1}^N\lambda_i-\displaystyle\sum_{i=1}^N\lambda_iy_i\omega^Tx_i

(3) 求 L ω = 0 {\frac {\partial L} {\partial \omega}}=0

L ω = 1 2 2 ω i = 1 N λ i y i x i = 0 {\frac {\partial L} {\partial \omega}}={\frac 1 2}·2·\omega-\displaystyle\sum_{i=1}^N\lambda_iy_ix_i=0
则, ω = i = 1 N λ i y i x i \omega=\displaystyle\sum_{i=1}^N\lambda_iy_ix_i

(4) 将 ω = i = 1 N λ i y i x i \omega=\displaystyle\sum_{i=1}^N\lambda_iy_ix_i 代入到 L ( ω , b , λ ) L(\omega,b,\lambda)

L ( ω , b , λ ) = 1 2 ( i = 1 N λ i y i x i ) T ( i = 1 N λ i y i x i ) + i = 1 N λ i i = 1 N λ i y i ( j = 1 N λ j y j x j ) T x i L(\omega,b,\lambda)={\frac 1 2}(\displaystyle\sum_{i=1}^N\lambda_iy_ix_i)^T(\displaystyle\sum_{i=1}^N\lambda_iy_ix_i)+\displaystyle\sum_{i=1}^N\lambda_i-\displaystyle\sum_{i=1}^N\lambda_iy_i(\displaystyle\sum_{j=1}^N\lambda_jy_jx_j)^Tx_i

注意:
λ i R , y i { 1 , 1 } , x i R p \lambda_i\in\Reals,y_i\in\{-1,1\},x_i\in\Reals^p
∴( i = 1 N λ i y i x i ) T = i = 1 N λ i y i x i T \displaystyle\sum_{i=1}^N\lambda_iy_ix_i)^T=\displaystyle\sum_{i=1}^N\lambda_iy_ix_i^T
ω T ω = ( i N λ i y i x i T ) ( j N λ j y j x j ) = i N j N λ i λ j y i y j x i T x j \omega^T\omega=(\displaystyle\sum_{i}^N\lambda_iy_ix_i^T)·(\displaystyle\sum_{j}^N\lambda_jy_jx_j)=\displaystyle\sum_{i}^N\displaystyle\sum_{j}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j
同理, i = 1 N λ i y i ( j = 1 N λ j y j x j ) T x i = i = 1 N λ i y i j = 1 N λ j y j x j T x i = i N j N λ i λ j y i y j x j T x i = i N j N λ i λ j y i y j x i T x j x i T x j = x j T x i R \displaystyle\sum_{i=1}^N\lambda_iy_i(\displaystyle\sum_{j=1}^N\lambda_jy_jx_j)^Tx_i=\displaystyle\sum_{i=1}^N\lambda_iy_i\displaystyle\sum_{j=1}^N\lambda_jy_jx_j^Tx_i\\=\displaystyle\sum_{i}^N\displaystyle\sum_{j}^N\lambda_i\lambda_jy_iy_jx_j^Tx_i\\=\displaystyle\sum_{i}^N\displaystyle\sum_{j}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j\Larr x_i^Tx_j=x_j^Tx_i\in\Reals
发现上面两个结果一样!因此,可以约掉啦~
( i = 1 N λ i y i x i ) T ( i = 1 N λ i y i x i ) = i = 1 N λ i y i ( j = 1 N λ j y j x j ) T x i = i N j N λ i λ j y i y j x i T x j (\displaystyle\sum_{i=1}^N\lambda_iy_ix_i)^T(\displaystyle\sum_{i=1}^N\lambda_iy_ix_i)=\displaystyle\sum_{i=1}^N\lambda_iy_i(\displaystyle\sum_{j=1}^N\lambda_jy_jx_j)^Tx_i\\=\displaystyle\sum_{i}^N\displaystyle\sum_{j}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j

L ( ω , b , λ ) = i = 1 N λ i 1 2 i N j N λ i λ j y i y j x i T x j m i n ω , b L ( ω , b , λ ) L(\omega,b,\lambda)=\displaystyle\sum_{i=1}^N\lambda_i-{\frac 1 2}\displaystyle\sum_{i}^N\displaystyle\sum_{j}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j\xRightarrow{即} {min \atop \omega,b}L(\omega,b,\lambda)
代入式(3)即,
( 4 ) { m a x λ i = 1 N λ i 1 2 i = 1 N j = 1 N λ i λ j y i y j x i T x j m a x λ m i n ω , b L ( ω , b , λ ) s . t .   λ i 0 , i = 1 N λ i y i = 0 (4)\begin{cases}{max \atop \lambda}\displaystyle\sum_{i=1}^N\lambda_i-{\frac 1 2}\displaystyle\sum_{i=1}^N\displaystyle\sum_{j=1}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j\Larr {max \atop \lambda}{min \atop \omega,b}L(\omega,b,\lambda) \\s.t.\space \lambda_i\geqslant0,\displaystyle\sum_{i=1}^N\lambda_iy_i=0 \end{cases}

(5) 对偶问题的最终优化式

最优化问题常由 m i n min 表示
( 5 ) { m i n λ 1 2 i = 1 N j = 1 N λ i λ j y i y j x i T x j i = 1 N λ i s . t .   λ i 0 , i = 1 N λ i y i = 0 (5)\begin{cases}{min \atop \lambda}{\frac 1 2}\displaystyle\sum_{i=1}^N\displaystyle\sum_{j=1}^N\lambda_i\lambda_jy_iy_jx_i^Tx_j-\displaystyle\sum_{i=1}^N\lambda_i \\s.t.\space \lambda_i\geqslant0,\displaystyle\sum_{i=1}^N\lambda_iy_i=0 \end{cases}

5. KKT条件求解对偶问题

原问题和对偶问题具有强对偶关系 \xLeftrightarrow{充要条件} 满足KKT条件

拉格朗日方程(上面第2点):
L ( ω , b , λ ) = 1 2 ω T ω + i = 1 N λ i ( 1 y i ( ω T x i + b ) ) L(\omega,b,\lambda)={\frac 1 2}{\omega^T}\omega+\displaystyle\sum_{i=1}^N\lambda_i(1-y_i(\omega^Tx_i+b)) , λ i 0 \lambda_i\geqslant0

KKT(Karush-Kuhn-Tucker)条件:
{ L ω = 0 , L b = 0 , L λ = 0 λ i 0 1 y i ( ω T x i + b ) 0 2 λ i ( 1 y i ( ω T x i + b ) ) = 0 L ( ω , b , λ ) = 1 2 ω T ω \begin{cases}{\frac {\partial L}{\partial \omega}}=0,{\frac {\partial L}{\partial b}}=0,{\frac {\partial L}{\partial \lambda}}=0 \\\lambda_i\geqslant0\Rarr 拉格朗日乘子法的要求 \\1-y_i(\omega^Tx_i+b)\leqslant0\Rarr 上面第2点解释了 \\\lambda_i(1-y_i(\omega^Tx_i+b))=0\Rarr此时,L(\omega,b,\lambda)={\frac 1 2}\omega^T\omega,为最大值;松弛互补条件 \end{cases}

松弛互补条件?
基于互补松弛型给出强对偶的一个必要条件——KKT条件,最后给出凸函数性KKT条件的充分必要性。

(1) 最优解 ω = i = 1 N λ i y i x i \omega^*=\displaystyle\sum_{i=1}^N\lambda_iy_ix_i

就是之前(3)中 L ω = 0 {\frac {\partial L} {\partial \omega}}=0 的结果。

(2) 最优解 b = y k i N λ i y i x i T x k b^*=y_k-\displaystyle\sum_i^N\lambda_iy_ix_i^Tx_k

假设 ( x k , y k ) ,   s . t .   1 y k ( ω T x k + b ) = 0 \exist (x_k,y_k),\space s.t.\space 1-y_k(\omega^Tx_k+b)=0 , ( x k , y k ) (x_k,y_k) 为支持向量 ω T x k + b { 1 , 1 } \omega^Tx_k+b\in\{-1,1\}
y k ( ω T x k + b ) = 1 y k = ± 1 , y k 2 = 1 y k 2 ( ω T x k + b ) = y k ω T x k + b = y k b = y k ω T x k = y k ( ω ) T x k = y k i = 1 N λ i y i x i 由y_k(\omega^Tx_k+b)=1 \\∵y_k=±1,y_k^2=1 \\∴y_k^2(\omega^Tx_k+b)=y_k \\∴\omega^Tx_k+b=y_k \\∴b^*=y_k-\omega^Tx_k=y_k-(\omega^*)^Tx_k=y_k-\displaystyle\sum_{i=1}^N\lambda_iy_ix_i

(3) 根据 w , b w^*,b^* 得出超平面 w x + b w^*x+b^*

  1. f ( x ) = s i g n ( ( w ) T x + b ) f(x)=sign((w^*)^Tx+b^*)
  2. w = i = 1 N λ i y i x i w^*=\displaystyle\sum_{i=1}^N\lambda_iy_ix_i 可看做是 D a t a = { ( x i , y i ) } i = 1 N , x i R p , y i { 1 , + 1 } Data=\{(x_i, y_i)\}_{i=1}^N,x_i\in\R^p,y_i\in\{-1,+1\} 的线性组合
  3. λ i \lambda_i 只对支持向量才有意义,即 1 y i ( ω T x i + b ) = 0 1-y_i(\omega^Tx_i+b)=0 上的点,此时, λ i 0 \lambda_i\geqslant0 ;对于非支持向量不起作用,此时 λ i = 0 \lambda_i=0