泰勒公式的基本形式如式(1)所示,
$$\begin{equation}f(x) = \sum_{n=0}^{\infty}\frac{f^{(n)}(x_0)}{n!}(x-x_0)^{n}\tag{1}\end{equation}$$
特殊地,一阶泰勒展开,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x0)\tag{2}\end{equation}$$
二阶泰勒展开,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2}\tag{3}\end{equation}$$
令$x^t = x^{t-1}+\Delta x$,则$f(x^t)$的二阶泰勒展开写为,
$$\begin{equation}f(x^t)=f(x^{t-1}+\Delta x)\approx f(x^{t-1})+f^{'}(x^{t-1})\Delta x+f^{''}(x^{t-1})\frac{\Delta x^2}{2}\tag{4}\end{equation}$$算法
机器学习任务中,须要最小化损失函数$L(\theta)$,其中$\theta$为待求解的模型参数。梯度降低法经常使用来解决这一类无约束最优化问题。梯度降低法是一种迭代式算法,最开始给$\theta^0$赋一个初始值,而后在每个迭代过程当中,更新$\theta^t$的值,使得损失函数$L(\theta^t)$减少。
$$\begin{equation}\theta^t = \theta^{t-1}+\Delta \theta\tag{5}\end{equation}$$
将$L(\theta^t)$在$\theta^{t-1}$处进行一阶泰勒展开:
$$\begin{equation}L(\theta^t)=L(\theta^{t-1}+\Delta \theta)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta\tag{6}\end{equation}$$
梯度降低法每个迭代,须要使损失函数$L$变小,即要使$L(\theta^t)<L(\theta^{t-1})$。结合式(6),可取$\Delta \theta = -\alpha L^{'}(\theta^{t-1})$,则$\theta$的迭代公式为,
$$\begin{equation}\theta^t = \theta^{t-1}-\alpha L^{'}(\theta^{t-1})\tag{7}\end{equation}$$
这里的$\alpha$是步长,即常说的learning rate,一般直接赋一个较小的值。app
牛顿法自己是一阶算法,本质是求根算法。使用一阶泰勒公式推到牛顿法,则根据式(6),令$L(\theta^t)=0$,则能够推导出
$$\begin{equation}\theta^t = \theta^{t-1}-\frac{L(\theta^{t-1})}{L^{'}(\theta^{t-1})}\tag{8}\end{equation}$$
可是在机器学习任务中,牛顿法被用来求解目标函数的最优解。若是用来求最优解(极值点),这时就要求导数为0的跟,就须要求二阶导数,就变成了二阶算法。牛顿法求解最优值是迭代算法,每一次迭代,在现有极小点估计值$\theta^{t-1}$的附近,对$L(\theta^{t})$作二阶泰勒展开,进而找到极小点的下一个估计值$\theta^t$。
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta + L^{''}(\theta^{t-1})\frac{\Delta \theta^2}{2}\tag{9}\end{equation}$$
先只考虑参数$\theta$为标量的状况,可将一阶导数$L^{'}$和二阶导数$L^{''}$分别记做$g$和$h$,
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+g\Delta \theta + h\frac{\Delta \theta^2}{2}\tag{10}\end{equation}$$
此时,要使得$L(\theta^t)$极小,则让$g\Delta \theta + h\frac{\Delta \theta^2}{2}$极小($L(\theta^{t-1})$看做常量),令$\frac{\partial(g\Delta \theta + h\frac{\Delta \theta^2}{2})}{\partial \Delta \theta}=0$,获得$\Delta \theta = -\frac{g}{h}$,所以
$$\begin{equation}\theta^{t} = \theta^{t-1}+\Delta \theta=\theta^{t-1}-\frac{g}{h}\tag{11}\end{equation}$$
将参数$\theta$由标量推广到向量形式,每一次迭代,参数的更新方式为,
$$\begin{equation}\theta^{t} =\theta^{t-1}-H^{-1}g\tag{12}\end{equation}$$
其中,$g$为雅克比矩阵,矩阵中各元素对应一阶偏导数;$H$为Hessian矩阵,各元素对应二阶偏导数。机器学习
import numpy as np from scipy.misc import derivative from sympy import symbols, diff import sympy from sympy.tensor.array import derive_by_array import math x1, x2, x3, x4 = symbols('x1, x2, x3, x4') Y = symbols('Y') # Y = x1 ** 2 + x2 ** 2 # Y = x1**2 + 3*x1 - 3*x1*x2 + 2*x2**2 + 4*x2 Y = x1 **2 + x2 ** 2 - 4 * x1 - 6 * x2 + 13 + sympy.sqrt(x3) - sympy.sqrt(x4) var_list = [x1, x2, x3, x4] def gradient(f, X_, X): grad_ = sympy.Array([diff(f, x_) for x_ in X_]) grad = grad_ for i, x_ in enumerate(X_): grad = grad.subs(x_, X[i]) return grad_, np.array(grad.tolist(), dtype=np.float32) def jacobian2(f, X_, X): G_ = sympy.Array([diff(f, x_) for x_ in X_]) G = G_ for i, x_ in enumerate(X_): G = G.subs(x_, X[i]) return G_, np.array(G.tolist(), dtype=np.float32) def hessian2(f, X_, X): H_ = sympy.Array([[diff(f, x_1).diff(x_2) for x_2 in X_] for x_1 in X_]) H = H_ for i, x_ in enumerate(X_): H = H.subs(x_, X[i]) return H_, np.array(H.tolist(), dtype=np.float32) def jacobian3(): res = derive_by_array(Y, (x1, x2)) return res def hessian3(): res = derive_by_array(derive_by_array(Y, (x1, x2)), (x1, x2)) return res def newton(f, X_, X, iters): """ 牛顿法 :param f 函数 :param x 输入 :param iters 迭代次数 """ G_, G = jacobian2(f, X_, X) H_, H = hessian2(f, X_, X) Hessian_T = np.linalg.inv(H) H_G = np.matmul(Hessian_T, G) # print(H_G) X_new = X for i in range(0, iters): X_new = X_new - H_G print("Iteration {}: {}".format(i+1, X_new)) G_tmp = G_ H_tmp = H_ # print(G_tmp) # print("X_new", X_new) for i, x_ in enumerate(X_): H_tmp = H_tmp.subs(x_, X_new[i]) G_tmp = G_tmp.subs(x_, X_new[i]) # print("G_tmp: ", G_tmp) Hessian_T = np.linalg.inv(np.array(H_tmp.tolist(), dtype=np.float32)) # print(Hessian_T) H_G = np.matmul(Hessian_T, np.array(G_tmp.tolist(), dtype=np.float32)) # print(H_G) return X_new def gradient_descent_1d_decay(f, X_, X, iters, learning_rate=0.01, decay=0.5): for i in range(iters): # learning_rate = learning_rate * 1.0 / (1.0 + decay * i) _, grad = gradient(f, X_, X) X = X - learning_rate * grad X[X<0] = 0.0 print("Iteration {}: {}".format(i+1, X)) return X if __name__ == "__main__": j2_, j2 = jacobian2(Y, var_list, np.array([1, 2, 1, 1])) h2_, h2 = hessian2(Y, var_list, np.array([1, 2, 1, 1])) print(j2_, j2) print(h2_, h2) print("newton:----------------------------") print(newton(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 20)) print("gradient descent:----------------------------") print(gradient_descent_1d_decay(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 200))