Line Search and Quasi-Newton Methods 线性搜索与拟牛顿法

时间 2019-11-20

标签 line search quasi newton methods 线性搜索牛顿栏目应用数学繁體版

原文原文链接

Gradient Descent

机器学习中不少模型的参数估计都要用到优化算法，梯度降低是其中最简单也用得最多的优化算法之一。梯度降低(Gradient Descent)[3]也被称之为最快梯度(Steepest Descent)，可用于寻找函数的局部最小值。梯度降低的思路为，函数值在梯度反方向降低是最快的，只要沿着函数的梯度反方向移动足够小的距离到一个新的点，那么函数值一定是非递增的，如图1所示。算法

梯度降低思想的数学表述以下：app

b = a - α \nabla F (a) \Rightarrow f (a) \geq f (b) (1)

其中机器学习

x k + 1 = x k - α k \nabla f (x k), 0 \leq k \leq n (2)

ide

f (x 0) \geq f (x 1) \geq f (x 2) \geq \dots \geq f (x n) (3)

函数

f (x k + α d k) < f (x k)

学习

d k = - B k \nabla f (x k) (5)

测试

Line Search

在给定搜索方向优化

α = a r g

Bisection Search

二分线性搜索(Bisection Line Search)[2]可用于求解函数的根，其思想很简单，就是不断将现有区间划分为两半，选择一定含有使atom

L = (1 2 ) n α ^ (7)

L \leq ϵ \Rightarrow k \leq [log 2 (α ^ ϵ ) ] (8)

 1 def bisection(dfun,theta,args,d,low,high,maxiter=1e4):
 2     """
 3     #Functionality:find the root of the function(fun) in the interval [low,high]
 4     #@Parameters
 5     #dfun:compute the graident of function f(x)
 6     #theta:Parameters of the model
 7     #args:other variables needed to compute the value of dfun
 8     #[low,high]:the interval which contains the root
 9     #maxiter:the max number of iterations
10     """
11     eps=1e-6
12     val_low=np.sum(dfun(theta+low*d,args)*d.T)
13     val_high=np.sum(dfun(theta+high*d,args)*d.T)
14     if val_low*val_high>0:
15         raise Exception('Invalid interval!')
16     iter_num=1
17     while iter_num<maxiter:
18         mid=(low+high)/2
19         val_mid=np.sum(dfun(theta+mid*d,args)*d.T)
20         if abs(val_mid)<eps or abs(high-low)<eps:
21             return mid
22         elif val_mid*val_low>0:
23             low=mid
24         else:
25             high=mid
26         iter_num+=1

Backtracking

回溯线性搜索(Backing Line Search)[1]基于Armijo准则计算搜素方向上的最大步长，其基本思想是沿着搜索方向移动一个较大的步长估计值，而后以迭代形式不断缩减步长，直到该步长使得函数值

f (x k + α d k) \leq f (x k) + c 1 α f' (x k) T d k (9)

h' (0) < c 1 h' (0) < 0 (10)

h' (0) = lim α \to 0 h ( α ) - h ( 0 ) α = lim α \to 0 f ( x k +

f ( x k + α d k ) - f ( x k ) α < c f ' ( x k ) T d k (12)

 1 def ArmijoBacktrack(fun,dfun,theta,args,d,stepsize=1,tau=0.5,c1=1e-3):
 2     """
 3     #Functionality:find an acceptable stepsize via backtrack under Armijo rule
 4     #@Parameters
 5     #fun:compute the value of objective function
 6     #dfun:compute the gradient of objective function
 7     #theta:a vector of parameters of the model
 8     #stepsize:initial step size
 9     #c1:sufficient decrease Parameters
10     #tau:rate of shrink of stepsize
11     """
12     slope=np.sum(dfun(theta,args)*d.T)
13     obj_old=costFunction(theta,args)
14     theta_new=theta+stepsize*d
15     obj_new=costFunction(theta_new,args)
16     while obj_new>obj_old+c1*stepsize*slope:
17         stepsize*=tau
18         theta_new=theta+stepsize*d
19         obj_new=costFunction(theta_new,args)
20     return stepsize

Interpolation

基于Armijo准则的回溯线性搜索的收敛速度没法获得保证，特别是要回退不少次后才能落入知足Armijo准则的区间。若是咱们根据已有的函数值和导数信息，采用多项式插值法(Interpolation)[12,6,5,9]拟合函数，而后根据该多项式函数估计函数的极值点，这样选择合适步长的效率会高不少。假设咱们只有

h q (α) = (h ( α 0 ) - h ( 0 ) - α 0 h ' ( 0 ) α 2 0 ) α 2 + h

α 1 = h ' ( 0 ) α 2 0 2 [ h ( 0 ) + h ' ( 0 ) α 0 - h ( α 0 ) ]

h c (α) = a α 3 + b α 2 + h' (0) α + h (0) (15)

[a b] = 1 α 2 i - 1 α 2 i ( α i - α i - 1 ) [

α i + 1 = - b + b 2 - 3 a h ' ( 0 )----------\sqrt 3 a (17)

H 3 (α) = [1 + 2 α i - α α i - α i - 1 ]

α i + 1 = α i - (α i - α i - 1) [h ' ( α i ) + d 2 -

d 1 = h' (α i) + h' (α i - 1) - 3 [h ( α i ) - h ( α

d 2 = s i g n (α i - α i - 1) d 2 1 - h' (α i - 1) h' (

 1 def quadraticInterpolation(a,h,h0,g0):
 2     """
 3     #Functionality:Approximate h(a) with a quadratic function and return its stationary point
 4     #@Parameters
 5     #a:current stepsize
 6     #h:a function value about stepsize,h(a)=f(x_k+a*d)
 7     #h:h(0)=f(x_k)
 8     #g0:h'(0)=f'(0)
 9     """
10     numerator=g0*a**2
11     denominator=2*(g0*a+h0-h)
12     if abs(denominator)<1e-12:#indicates that a is almost 0
13         return a
14     return numerator/denominator

def cubicInterpolation(a0,h0,a1,h1,h,g):
    """
    #Functionality:Approximate h(x) with a cubic function and return its stationary point
    #This version of cubic interpolation computes h'(x) as few as possible,suitable for the case in which computing derivative is more expensive than computing function values
    #@Parameters
    #a0 and a1 are stepsize it previous two iterations
    #h0:h(a0)
    #h1:h(a1)
    #h:h(0)=f(x)
    #g:h'(0)
    """
    mat=matlib.matrix([[a0**2,-a1**2],[-a0**3,a1**3]])
    vec=matlib.matrix([[h1-h-g*a1],[h0-h-g*a0]])
    ab=mat*vec/(a0**2*a1**2*(a1-a0))
    a=ab[0,0]
    b=ab[1,0]
    if abs(a)<1e-12:#a=0 and cubic function is a quadratic one
        return -g/(2*b)
    return (-b+np.sqrt(b**2-3*a*g))/(3*a)

def cubicInterpolationHermite(a0,h0,g0,a1,h1,g1):
    """
    #Functionality:Approximate h(a) with a cubic Hermite polynomial function and return its stationary point
    #This version of cubic interpolation computes h(a) as few as possible,suitable for the case in which computing derivative is easier than computing function values
    #@Parameters
    #a0 and a1 are stepsize it previous two iterations
    #h0:h(a0)
    #g0:h'(a0)
    #h1:h(a1)
    #g1:h'(a1)
    """
    d1=g0+g1-3*(h1-h0)/(a1-a0)
    d2=np.sign(a1-a0)*np.sqrt(d1**2-g0*g1)
    res=a1-(a1-a0)*(g1+d2-d2)/(g1-g0+2*d2)
    return res

基于Armijo准则的线性搜索的算法描述以下[4]对应的Armijo线性搜索的Python代码以下：

 1 def ArmijoLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-3,a_min=1e-7,max_iter=1e5):
 2     """
 3     #Functionality:Line search under Armijo condition with quadratic and cubic interpolation
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #a0:initial stepsize
11     #c1:constant used in Armijo condition
12     #a_min:minimun of stepsize
13     #max_iter:maximum of the number of iterations
14     """
15     eps=1e-6
16     c1=min(c1,0.5)#c1 should<=0.5
17     a_pre=h_pre=g_pre=0
18     a_cur=a0
19     f_val=fun(theta,args) #h(0)=f(0)
20     g_val=np.sum(dfun(theta,args)*d.T) #h'(0)=f'(x)^Td
21     h_cur=g_cur=0
22     k=0
23     while a_cur>a_min and k<max_iter:
24         h_cur=fun(theta+a_cur*d,args)
25         g_cur=np.sum(dfun(theta+a_cur*d,args)*d.T)
26         if h_cur<=f_val+c1*a_cur*g_val: #meet Armijo condition
27             return a_cur
28         if not k: #k=0,use quadratic interpolation
29             a_new=quadraticInterpolation(a_cur,h_cur,f_val,g_val)
30         else: #k>0,use cubic Hermite interpolation
31             a_new=cubicInterpolationHermite(a_pre,h_pre,g_pre,a_cur,h_cur,g_cur)
32         if abs(a_new-a_cur)<eps or abs(a_new)<eps: #safeguard procedure
33             a_new=a_cur/2
34         a_pre=a_cur
35         a_cur=a_new
36         h_pre=h_cur
37         g_pre=g_cur
38         k+=1
39     return a_min #failed search

Wolfe Search

前面说到单凭Armijo准则(不考虑回溯策略)选出的步长可能过小，为了排除这些微小的步长，咱们加上曲率的约束条件(如图5所示)

h' (α) = f' (x k + α d k) T d k \geq c 2 f' (x k) T d k

{f (x k + α d k) f' (x k + α d k) T d k \leq f (x k

{f (x k + α d k) | f' (x k + α d k) T d k |

f (x k + α' d k) = f (x k) + α' c 1 f' (x k) T d k (25)

f (x k + α' d k) - f (x k) = α' f' (x k + α'' d k) T d k

f' (x k + α'' d k) T d k = c 1 f' (x k) T d k > c 2 f'

在算法5中，

这一点结合图7就很容易理解了，我在图中分别用红色和绿色点标注了

 1 def WolfeLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-4,c2=0.9,a_min=1e-7,max_iter=1e5):
 2     """
 3     #Functionality:find a stepsize meeting Wolfe condition
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #a0:intial stepsize
11     #c1:constant used in Armijo condition
12     #c2:constant used in curvature condition
13     #a_min:minimun of stepsize
14     #max_iter:maximum of the number of iterations
15     """
16     eps=1e-16
17     c1=min(c1,0.5)
18     a_pre=0
19     a_cur=a0
20     f_val=fun(theta,args) #h(0)=f(x)
21     g_val=np.sum(dfun(theta,args)*d.T)
22     h_pre=f_val #h'(0)=f'(x)^Td
23     k=0
24     while k<max_iter and abs(a_cur-a_pre)>=eps:
25         h_cur=fun(theta+a_cur*d,args) #f(x+ad)
26         if h_cur>f_val+c1*a_cur*g_val or h_cur>=h_pre and k>0:
27             return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2)
28         g_cur=np.sum(dfun(theta+a_cur*d,args)*d.T)
29         if abs(g_cur)<=-c2*g_val:#satisfy Wolfe condition
30             return a_cur
31         if g_cur>=0:
32             return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2)
33         a_new=quadraticInterpolation(a_cur,h_cur,f_val,g_val)
34         a_pre=a_cur
35         a_cur=a_new
36         h_pre=h_cur
37         k+=1
38     return a_min

zoom函数的算法描述见6。zoom函数中须要传入搜寻区间

zoom函数对应的Python代码以下：

 1 def zoom(fun,dfun,theta,args,d,a_low,a_high,c1=1e-3,c2=0.9,max_iter=1e4):
 2     """
 3     #Functionality:enlarge the interval to find a stepsize meeting Wolfe condition
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #[a_low,a_high]:interval containing a stepsize satisfying Wolfe condition
11     #c1:constant used in Armijo condition
12     #c2:constant used in curvature condition
13     #max_iter:maximum of the number of iterations
14     """
15     if a_low>a_high:
16         print('low:%f,high:%f'%(a_low,a_high))
17         raise Exception('Invalid interval of stepsize in zoom procedure')
18     eps=1e-16
19     h=fun(theta,args) #h(0)=f(x)
20     g=np.sum(dfun(theta,args)*d.T) #h'(0)=f'(x)^Td
21     k=0
22     h_low=fun(theta+a_low*d,args)
23     h_high=fun(theta+a_high*d,args)
24     if h_low>h+c1*a_low*g:
25         raise Exception('Left endpoint violates Armijo condition in zoom procedure')
26     while k<max_iter and abs(a_high-a_low)>=eps:
27         a_new=(a_low+a_high)/2
28         h_new=fun(theta+a_new*d,args)
29         if h_new>h+c1*a_new*g or h_new>h_low:
30             a_high=a_new
31             h_high=h_new
32         else:
33             g_new=np.sum(dfun(theta+a_new*d,args)*d.T)
34             if abs(g_new)<=-c2*g: #satisfy Wolfe condition
35                 return a_new 
36             if g_new*(a_high-a_low)>=0:
37                 a_high=a_new
38                 h_high=h_new
39             else:
40                 a_low=a_new
41                 h_low=h_new
42         k+=1
43     return a_low #a_low definitely satisfy Armijo condition

Newton's Method

牛顿法(Newton's method)[8]以迭代方式求解函数的根，其基本思想是从一个初始点出发，不断在当前点

f (x k + △ x) \approx f (x k) + f' (x k) △ x + 1 2 △ x T B k △

f' (x k + 1) = f' (x k) + B k (x k + 1 - x k) (29)

x k + 1 = x k - B - 1 k f' (x k) (30)

Quasi-Newton Method

拟牛顿(Quasi-Newton)[11]算法可用于求解函数的局部最优解，也就是那些导数为0的驻点。牛顿法用于解决优化问题时，事先假设原函数可用二次函数近似，而后用一阶和二阶导数寻找局部最优解。而在拟牛顿算法中，不须要准确计算Hessian矩阵，取而代之的是运用下面的拟牛顿条件分析连续两个梯度向量获得的近似值矩阵

f' (x k + 1) - f' (x k) \approx B k + 1 (x k + 1 - x k)

 1 def BFGS(fun,dfun,theta,args,H=None,mode=0,eps=1e-12,max_iter=1e4):
 2     """
 3     #Functionality:find the minimum of objective function f(x)
 4     #@Parameters
 5     #fun:objective function f(x)
 6     #dfun:compute the gradient of f(x)
 7     #args:parameters needed by fun and dfun
 8     #theta:start vector of parameters of the model
 9     #H:initial inverse Hessian approximation
10     #mode:index of line search algorithm
11     """
12     x_pre=x_cur=theta
13     g=dfun(x_cur,args)
14     I=matlib.eye(theta.size)
15     if not H:#initialize H as an identity matrix
16         H=I
17     k=0
18     while k<max_iter and np.sum(np.abs(g))>eps:
19         d=-g*H
20         step=LineSearch(fun,dfun,x_pre,args,d,1,mode)
21         x_cur=x_pre+step*d
22         s=step*d
23         y=dfun(x_cur,args)-dfun(x_pre,args)
24         ys=np.sum(y*s.T)
25         if abs(ys)<eps:
26             return x_cur
27         change=(ys+np.sum(y*H*y.T))*(s.T*s)/(ys**2)-(H*y.T*s+s.T*y*H)/ys
28         H+=change
29         g=dfun(x_cur,args)
30         x_pre=x_cur
31         k+=1
32     return x_cur

下面咱们分析如何构造下L-BFGS的算法[10,13]。假设咱们如今处于优化过程的第

= = = H k g k V T k - 1 H

q i = (V k - i \dots V k - 2 V k - 1) g k (33)

a i = ρ k - i s T k - i q i - 1 (34)