机器学习就是须要找到模型的鞍点,也就是最优势。由于模型不少时候并非彻底的凸函数,因此若是没有好的优化方法可能会跑不到极值点,或者是局部极值,甚至是偏离。因此选择一个良好的优化方法是相当重要的。首先是比较常规的优化方法:梯度降低。如下介绍的这些算法都不是用于当个算法,能够试用于能可微的全部算法。 ###Gradient Descent 常见会用在logistics regression或者是linear regression里面。好比logistics regression,这个模型直接等于0是求不出解析解的,全部只能用一些迭代方法求解。当样本的label是 每个样本正确几率:
最大似然函数
化简:
target function就是如上的
,要作的就是优化上面的损失函数。
这个就是当前点的梯度,而函数的更新方向就是向着梯度的负方向进行的,因此最后是要减:git
def sigmoid(
x1, x2, theta1, theta2, theta3):
z = (theta1*x1+ theta2*x2+theta3).astype("float_")
return 1.0 / (1.0 + np.exp(-z))
'''cross entropy loss function '''
def cost(
x1, x2, y, theta_1, theta_2, theta_3):
sigmoid_probs = sigmoid(x1,x2, theta_1, theta_2,theta_3)
return -np.mean(y * np.log(sigmoid_probs)
+ (1 - y) * np.log(1 - sigmoid_probs))
复制代码
sigmoid函数和cost函数,使用牛顿法代替梯度降低作逻辑回归。github
def gradient(
x_1, x_2, y, theta_1, theta_2, theta_3):
sigmoid_probs = sigmoid(x_1,x_2, theta_1, theta_2,theta_3)
return np.array([np.sum((y - sigmoid_probs) * x_1),
np.sum((y - sigmoid_probs) * x_2),
np.sum(y - sigmoid_probs)])
复制代码
一阶导数,数据就是二维的,再加上一个偏置值就是三个导数了。算法
'''calculate hassion matrix'''
def Hassion(
x1, x2, y, theta1, theta2, theta3):
sigmoid_probs = sigmoid(x1,x2, theta1,theta2,theta3)
d1 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 * x1)
d2 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 * x2)
d3 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 )
d4 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 * x1)
d5 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 * x2)
d6 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 )
d7 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x1 )
d8 = np.sum((sigmoid_probs * (1 - sigmoid_probs)) * x2 )
d9 = np.sum((sigmoid_probs * (1 - sigmoid_probs)))
H = np.array([[d1, d2,d3],[d4, d5,d6],[d7,d8,d9]])
return H
复制代码
方法比较笨,直接一个一个迭代。bash
'''update parameter by newton method'''
def newton_method(x1, x2, y):
#initialize parameter
theta1 = 0.001
theta2 = -0.4
theta3 = 0.6
delta_loss = np.Infinity
loss = cost(x1, x2, y, theta1, theta2, theta3)
kxi = 0.0000001
max_iteration = 10000
i = 0
print('while i = ', i, ' Loss = ', loss, 'delta_loss = ', delta_loss)
while np.abs(delta_loss) > kxi and i < max_iteration:
i += 1
g = gradient(x1, x2, y, theta1, theta2, theta3)
hassion = Hassion(x1, x2, y, theta1, theta2, theta3)
H_inverse = np.linalg.inv(hassion)
delta = H_inverse @ g.T
delta_theta_1 = delta[0]
delta_theta_2 = delta[1]
delta_theta_3 = delta[2]
theta1 += delta_theta_1
theta2 += delta_theta_2
theta3 += delta_theta_3
loss_new = cost(x1, x2, y, theta1, theta2, theta3)
delta_loss = loss - loss_new
loss = loss_new
print('while i = ', i, ' Loss = ', loss, 'delta_loss = ', delta_loss)
return np.array([theta1, theta2, theta3])
复制代码
运行函数。app
def get_param_target(self):
np.random.seed(12)
num_observations = 5000
x1 = np.random.multivariate_normal([0, 0], [[1, .75], [.75, 1]], num_observations)
x2 = np.random.multivariate_normal([1, 4], [[1, .75], [.75, 1]], num_observations)
simulated_separableish_features = np.vstack((x1, x2)).astype(np.float32)
simulated_labels = np.hstack((np.zeros(num_observations),
np.ones(num_observations)))
return (simulated_separableish_features[:, 0], simulated_separableish_features[:, 1], simulated_labels)
复制代码
生成数据。dom
线性回归的一阶导数:$$\frac{dloss}{dw_j} = -\sum_{i=1}^{m}(y^i - \sum_{j=0}^{n-1}w_j*x_j^i)x_j^i$$ 线性回归的二阶导数:$$\frac{dloss}{dw_jdw_k} = \sum_{i=1}^m(x_j^ix_k^i)$$ 对照上面的公式是能够一一对应的。对于Armrji搜索只是用了第一条公式:$$f(x_k+\alpha p_k) <= f(x_k) + c_1 \alpha \nabla f_k^Tp_k$$一旦超过这个限制就能够退出了。 机器学习
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from DataProcesser import DataProcesser
def bfgs(feature, label, lam, maxCycle):
n = np.shape(feature)[1]
w0 = np.mat(np.zeros((n, 1)))
rho = 0.55
sigma = 0.4
Bk = np.eye(n)
k = 1
while (k < maxCycle):
print('Iterator: ', k, ' error: ', get_error(feature, label, w0))
gk = get_gradient(feature, label, w0, lam)
dk = np.mat(-np.linalg.solve(Bk, gk))
m = 0
mk = 0
while (m < 20):
newf = get_result(feature, label, (w0 + rho**m*dk), lam)
oldf = get_result(feature, label, w0, lam)
if (newf < oldf + sigma * (rho ** m)*(gk.T*dk)[0, 0]):
mk = m
break
m += 1
#BFGS
w = w0 + rho**mk * dk
sk = w-w0
yk = get_gradient(feature, label, w, lam) - gk
if (yk.T * sk > 0):
Bk = Bk - (Bk * sk * sk.T * Bk) / (sk.T * Bk * sk) + (yk * yk.T) / (yk.T * sk)
k = k + 1
w0 = w
return w0
def get_error(feature, label, w):
return (label - feature * w).T*(label - feature * w)/2
def get_gradient(feature, label, w, lam):
err = (label - feature * w).T
left = err * (-1) * feature
return left.T + lam * w
def get_result(feature, label, w, lam):
left = (label - feature * w).T * (label - feature * w)
right = lam * w.T * w
return (left + right)/2
复制代码
最主要的部分也就是公式那部分了,没有什么好讲的。实现效果: 函数
###L-BFGS L-BFGS的出现仍是为了节省,但此次是为了节省内存。每一次存储高纬度的数据须要的空间太多了,因此LBFGS的主要做用就是减小内存的开销。再也不完整的存储整一个矩阵了,它对BFGS算法作了近似,而是存储计算时所须要的,而后利用哪两个向量的计算来近似代替。对于
也不是存储全部的,而是只保存那么后面的m个,也就是最新的m个就好。这样就把空间复杂度
降到
。 毫无疑问,LBFGS是从BFGS演化而来的。对于逆矩阵的公式:$$H_{k+1}^{-1}=(I-\frac{s_ky_k^T}{y_k^Ts_k})H_k^{-1}(I-\frac{y_ks_k^T}{y_k^Ts_k})+\frac{s_ks_k^T}{y_k^Ts_k}$$ 假设
,改造一下上式:$$H_{k+1}^{-1} = v_k^TH_k^{-1}v_k+p_ks_ks_k^T$$ 假设:
递推一下:oop
根据推论,通常的有:$$D_{k+1} = (v_k^Tv_{k-1}^T...v_1^Tv_0^T)D_0(v_0v_1...v_{k-1})\ +(v_k^Tv_{k-1}^T...v_2^Tv_1^T)(p_0s_0s_0^T)(v_1v_2...v_{k-1}v_k)\ +(v_k^Tv_{k-1}^T...v_3^Tv_2^T)(p_1s_1s_1^T)(v_2v_3...v_k-1v_k)\ +...\ +(v_k^Tv_{k-1}^T)(p_{k-2}s_{k-2}s_{k-2}^T)(v_{k-1}v_{k})\ +v_k^T(p_{k-1}s_{k-1}s_{k-1}^T)v_k\ +p_ks_ks_k^T$$ 能够看的出,当前的海赛矩阵的逆矩阵是能够由给出的,因此直接存储对应的
便可,上面提到过为了压缩内存,能够只是存储后面的m个,因此更新一下公式:$$D_{k+1} = (v_k^Tv_{k-1}^T...v_{k-m+1}^Tv_{k-m}^T)D_0(v_{k-m}v_{k-m+1}...v_{k-1})\ +(v_k^Tv_{k-1}^T...v_{k-m+2}^Tv_{k-m+1}^T)(p_0s_0s_0^T)(v_{k-m+1}v_{k-m+2}...v_{k-1}v_k)\ +(v_k^Tv_{k-1}^T...v_{k-m+3}^Tv_{k-m+2}^T)(p_1s_1s_1^T)(v_{k-m+2}v_{k-m+3}...v_k-1v_k)\ +...\ +(v_k^Tv_{k-1}^T)(p_{k-2}s_{k-2}s_{k-2}^T)(v_{k-1}v_{k})\ +v_k^T(p_{k-1}s_{k-1}s_{k-1}^T)v_k\ +p_ks_ks_k^T$$ 公式长是有点长,可是已经有大神给出了一个很好的算法解决这个问题,two-loop-recursion算法:
end for
这个式子到底行不行呢?证实一下理论: 学习
def lbfgs(feature, label, lam, maxCycle, m = 10):
n = np.shape(feature)[1]
w0 = np.mat(np.zeros((n, 1)))
rho = 0.55
sigma = 0.4
H0 = np.eye(n)
s = []
y = []
k = 1
gk = get_gradient(feature, label, w0, lam)
dk = -H0 * gk
while (k < maxCycle):
print('iter: ', k, ' error:', get_error(feature, label, w0))
m1 = 0
mk = 0
gk = get_gradient(feature, label, w0, lam)
while (m1 < 20):
newf = get_result(feature, label, (w0 + rho ** m1 * dk), lam)
oldf = get_result(feature, label, w0, lam)
if newf < oldf + sigma * (rho ** m1) * (gk.T * dk)[0, 0]:
mk = m1
break
m1 = m1 + 1
w = w0 + rho ** mk * dk
if k > m:
s.pop(0)
y.pop(0)
sk = w - w0
qk = get_gradient(feature, label, w, lam)
yk = qk - gk
s.append(sk)
y.append(yk)
t = len(s)
a = []
for i in range(t):
alpha = (s[t - i -1].T * qk) / (y[t - i - 1].T * s[t - i - 1])
qk = qk - alpha[0, 0] * y[t - i -1]
a.append(alpha[0, 0])
r = H0 * qk
for i in range(t):
beta = (y[i].T * r) / (y[i].T * s[i])
r = r + s[i] * (a[t - i - 1] - beta[0, 0])
if yk.T * sk > 0:
dk = -r
k = k + 1
w0 = w
return w0
复制代码
中间还要有一个判断降低方向的过程。
####GitHub代码:github.com/GreenArrow2…