（一）线性循环神经网络（RNN）

时间 2021-02-18

标签 html python git github 算法微信网络 app dom 函数栏目应用数学繁體版

原文原文链接

做者：chen_h
微信号 & QQ：862251340
微信公众号：coderpai
简书地址：https://www.jianshu.com/p/160...html

这篇教程是翻译 Peter Roelants写的循环神经网络教程，做者已经受权翻译，这是原文。

该教程将介绍如何实现一个循环神经网络（RNN），一共包含两部分。你能够在如下连接找到完整内容。python

这篇教程中的代码是由 Python 2 IPython Notebook产生的，在教程的最后，我会给出所有代码的连接，帮助学习。神经网络中有关矩阵的运算咱们采用NumPy来构建，画图使用Matplotlib来构建。若是你来没有安装这些软件，那么我强烈建议你使用Anaconda Python来安装，这个软件包中包含了运行这个教程的全部软件包，很是方便使用。git

循环神经网络

本教程主要包含三部分：github

一个很是简单的循环神经网络（RNN）
基于时序的反向传播（BPTT）
弹性优化算法

循环神经网络是一种能够解决序列数据的模型。在时序模型上面，这种循环关系能够定义成以下式子：算法

其中，Sk表示在时间k时刻的状态，Xk是在时序k时刻的输入数据，Wrec和Wx都是神经网络的连接权重。若是简单的理解，能够把RNN理解成是一个带反馈回路的状态模型。因为循环关系和延时处理，时序状态被加入了模型之中。这个延时操做赋予了模型记忆力，由于它能记住模型前面一个状态。微信

神经网络最后的输出结果Yk是在时间k时刻计算出来的，便是经过前面一个或者多个状态Sk，....，Sk+j计算出来的。网络

接下来，咱们就能够经过输入的数据Xk和前一步的状态S(k-1)，来计算当前的状态S(k)，或者经过输入的数据Xk和前一步的状态S(k)来预测下一步的状态S(k+1)。app

这篇教程会说明循环神经网络和通常的前馈神经网络没有很大的不一样，可是在训练的方式上面可能会有一些不一样。dom

线性循环神经网络

这部分教程咱们来设计一个简单的RNN模型，这个模型的输入是一个二进制的数据流，任务是去计算这个二进制的数据流中存在几个1。函数

在这个教程中，咱们设计的RNN模型中的状态只有一维，在每一个时间点上，输入数据也是一维的，最后输出的结果就是序列状态的最后一个状态，即y = S(k)。咱们将RNN模型进行展开，就能够获得下图的模型。注意，展开的模型能够看作是一个 (n+1) 层的神经网络，每一层使用相同的连接权重Wrec和Wx。

虽然实现和训练这个模型是一件很是有意思的事情，可是咱们能够很容易获得，当W(rec) = W(x) = 1时，模型是最优的。

咱们先导入教程须要的软件包

import numpy as np 
import matplotlib
import matplotlib.pyplot as plt 
from matplotlib import cm
from matplotlib.colors import LogNorm

定义数据集

输入数据集 X 一共有20组数据，每组数据的长度是10，即每组数据的时间状态步长是10。输入数据是由均匀的随机分布产生的，取值 0 或者 1 。

输出结果是输入的二进制数据流中存在几个1，也就是把序列的每一位都加起来求和的结果。

# Create dataset
nb_of_samples = 20
sequence_len = 10
# Create the sequences
X = np.zeros((nb_of_samples, sequence_len))
for row_idx in range(nb_of_samples):
    X[row_idx,:] = np.around(np.random.rand(sequence_len)).astype(int)
# Create the targets for each sequence
t = np.sum(X, axis=1)

经过基于时序的反向传播（BPTT）算法进行训练

训练RNN的一个典型算法是BPTT（backpropagation through time）算法。经过名字，你也能发现这是一个基于BP的算法。

若是你很了解常规的BP算法，那么BPTT算法和常规的BP算法没有很大的不一样。惟一的不一样是，RNN须要每个特定的时间步骤中，将每一个神经元进行展开处理而已。展开图已经在教程的最前面进行了说明。展开后，模型就和规则的神经网络模型很像了。惟一不一样是，RNN有多个输入源（前一个时间步骤的输入状态和当前的输入数据）和每一层中的连接矩阵（ W(rec)和W(x) ）都是同样的。

正向传播计算RNN的输出结果

正向传播的时候，咱们会把RNN展开进行处理，这样就能够按照规则的神经网络进行处理了。RNN模型最后的输出结果将会被使用在损失函数的计算中，用于训练网络。（其实这些都和常规的多层神经网络同样。）

当咱们将RNN进行展开计算时，在不一样的时间点上面，其实循环关系是相同的，咱们将这个相同的循环关系在 update_state 函数中实现了。

forward_states函数经过 for 循环，将update_state函数应用到每个时间点上面。若是咱们将这些步骤都矢量化，那么就能够进行并行计算了。跟常规神经网络同样，咱们须要给权重进行初始化。在这个教程中，咱们将权重初始化为0。

最后，咱们经过累加因此输入数据的偏差进行计算均方偏差函数（MSE）来获得损失函数 ξ 。在程序中，咱们使用 cost 函数来实现。

# Define the forward step functions
def update_state(xk, sk, wx, wRec):
    """
    Compute state k from the previous state (sk) and current input (xk),
    by use of the input weights (wx) and recursive weights (wRec).
    """
    return xk * wx + sk * wRec

def forward_states(X, wx, wRec):
    """
    Unfold the network and compute all state activations given the input X,
    and input weights (wx) and recursive weights (wRec).
    Return the state activations in a matrix, the last column S[:,-1] contains the
    final activations.
    """
    # Initialise the matrix that holds all states for all input sequences.
    # The initial state s0 is set to 0.
    S = np.zeros((X.shape[0], X.shape[1]+1))
    # Use the recurrence relation defined by update_state to update the 
    #  states trough time.
    for k in range(0, X.shape[1]):
        # S[k] = S[k-1] * wRec + X[k] * wx
        S[:,k+1] = update_state(X[:,k], S[:,k], wx, wRec)
    return S

def cost(y, t): 
    """
    Return the MSE between the targets t and the outputs y.
    """
    return ((t - y)**2).sum() / nb_of_samples

反向传播的梯度计算

在进行反向传播过程以前，咱们须要先计算偏差的对于输出结果的梯度∂ξ/∂y，函数 output_gradient 实现了这个梯度计算过程。这个梯度将会被经过反向传播算法一层一层的向前传播，函数 backward_gradient 实现了这个计算过程。具体的数学推导以下所示：

梯度最开始的计算公式为：

其中，n 表示RNN展开以后的时间步长。须要注意的是，参数 Wrec 担当着反向传递偏差的角色。

损失函数对于权重的梯度是经过累加每一层中的梯度获得的。具体数学公式以下：

def output_gradient(y, t):
    """
    Compute the gradient of the MSE cost function with respect to the output y.
    """
    return 2.0 * (y - t) / nb_of_samples

def backward_gradient(X, S, grad_out, wRec):
    """
    Backpropagate the gradient computed at the output (grad_out) through the network.
    Accumulate the parameter gradients for wX and wRec by for each layer by addition.
    Return the parameter gradients as a tuple, and the gradients at the output of each layer.
    """
    # Initialise the array that stores the gradients of the cost with respect to the states.
    grad_over_time = np.zeros((X.shape[0], X.shape[1]+1))
    grad_over_time[:,-1] = grad_out
    # Set the gradient accumulations to 0
    wx_grad = 0
    wRec_grad = 0
    for k in range(X.shape[1], 0, -1):
        # Compute the parameter gradients and accumulate the results.
        wx_grad += np.sum(grad_over_time[:,k] * X[:,k-1])
        wRec_grad += np.sum(grad_over_time[:,k] * S[:,k-1])
        # Compute the gradient at the output of the previous layer
        grad_over_time[:,k-1] = grad_over_time[:,k] * wRec
    return (wx_grad, wRec_grad), grad_over_time

梯度检查

对于RNN，咱们也须要对其进行梯度检查，具体的检查方法能够参考在常规多层神经网络中的梯度检查。若是在反向传播中的梯度计算正确，那么这个梯度值应该和数值计算出来的梯度值应该是相同的。

# Perform gradient checking
# Set the weight parameters used during gradient checking
params = [1.2, 1.2]  # [wx, wRec]
# Set the small change to compute the numerical gradient
eps = 1e-7
# Compute the backprop gradients
S = forward_states(X, params[0], params[1])
grad_out = output_gradient(S[:,-1], t)
backprop_grads, grad_over_time = backward_gradient(X, S, grad_out, params[1])
# Compute the numerical gradient for each parameter in the layer
for p_idx, _ in enumerate(params):
    grad_backprop = backprop_grads[p_idx]
    # + eps
    params[p_idx] += eps
    plus_cost = cost(forward_states(X, params[0], params[1])[:,-1], t)
    # - eps
    params[p_idx] -= 2 * eps
    min_cost = cost(forward_states(X, params[0], params[1])[:,-1], t)
    # reset param value
    params[p_idx] += eps
    # calculate numerical gradient
    grad_num = (plus_cost - min_cost) / (2*eps)
    # Raise error if the numerical grade is not close to the backprop gradient
    if not np.isclose(grad_num, grad_backprop):
        raise ValueError('Numerical gradient of {:.6f} is not close to the backpropagation gradient of {:.6f}!'.format(float(grad_num), float(grad_backprop)))
print('No gradient errors found')

No gradient errors found

参数更新

因为不稳定的梯度，RNN是很是难训练的。这也使得通常对于梯度的优化算法，好比梯度降低，都不能使得RNN找到一个好的局部最小值。

咱们在下面的两张图中说明了RNN梯度的不稳定性。第一张图表示，当咱们给定 w(x) 和 w(rec) 时获得的损失表面图。图中带颜色标记的地方，是咱们取了几个值作的实验结果。从图中，咱们能够发现，当偏差表面的值接近于0时，w(x) = w(rec) = 1。可是当 |w(rec)| > 1时，偏差表面的值增长的很是迅速。

第二张图咱们经过几组数据模拟了梯度的不稳定性，这个随着时间步长而不稳定的梯度的形式和等比数列的形式很像，具体数学公式以下：

在状态S(k)时的梯度，反向传播m步获得的状态S(k-m)能够被写成：

在咱们简单的线性模型中，若是 |w(rec)| > 1，那么梯度是一个指数爆炸的增加。若是 |w(rec)| < 1，那么梯度将会消失。

关于指数暴涨，在第二张图中，当咱们取 w(x) =1, w(rec) = 2时，在图中显示梯度是指数爆炸增加的，当咱们取 w(x) =1, w(rec) = -2时，正负徘徊指数增加，为何会出现徘徊？是由于咱们把参数 w(rec) 取成了负数。这个指数爆炸说明了，模型的训练对参数 w(rec) 是很是敏感的。

关于梯度消失，在第二张图中，当咱们取 w(x) = 1, w(rec) = 0.5和 w(x) = 1, w(rec) = -0.5时，那么梯度将会指数降低，直至消失。这个梯度消失表示模型不能长时间的训练，由于最后梯度将会消失。

若是 w(rec) = 0 时，梯度立刻变成了0。当 w(rec) = 1时，梯度随着时间不变。

在下一部分，咱们将说明怎么去优化一个不稳定的偏差函数。

# Define plotting functions

# Define points to annotate (wx, wRec, color)
points = [(2,1,'r'), (1,2,'b'), (1,-2,'g'), (1,0,'c'), (1,0.5,'m'), (1,-0.5,'y')]

def get_cost_surface(w1_low, w1_high, w2_low, w2_high, nb_of_ws, cost_func):
    """Define a vector of weights for which we want to plot the cost."""
    w1 = np.linspace(w1_low, w1_high, num=nb_of_ws)  # Weight 1
    w2 = np.linspace(w2_low, w2_high, num=nb_of_ws)  # Weight 2
    ws1, ws2 = np.meshgrid(w1, w2)  # Generate grid
    cost_ws = np.zeros((nb_of_ws, nb_of_ws))  # Initialize cost matrix
    # Fill the cost matrix for each combination of weights
    for i in range(nb_of_ws):
        for j in range(nb_of_ws):
            cost_ws[i,j] = cost_func(ws1[i,j], ws2[i,j])
    return ws1, ws2, cost_ws

def plot_surface(ax, ws1, ws2, cost_ws):
    """Plot the cost in function of the weights."""
    surf = ax.contourf(ws1, ws2, cost_ws, levels=np.logspace(-0.2, 8, 30), cmap=cm.pink, norm=LogNorm())
    ax.set_xlabel('$w_{in}$', fontsize=15)
    ax.set_ylabel('$w_{rec}$', fontsize=15)
    return surf

def plot_points(ax, points):
    """Plot the annotation points on the given axis."""
    for wx, wRec, c in points:
        ax.plot(wx, wRec, c+'o', linewidth=2)

def get_cost_surface_figure(cost_func, points):
    """Plot the cost surfaces together with the annotated points."""
    # Plot figures
    fig = plt.figure(figsize=(10, 4))   
    # Plot overview of cost function
    ax_1 = fig.add_subplot(1,2,1)
    ws1_1, ws2_1, cost_ws_1 = get_cost_surface(-3, 3, -3, 3, 100, cost_func)
    surf_1 = plot_surface(ax_1, ws1_1, ws2_1, cost_ws_1 + 1)
    plot_points(ax_1, points)
    ax_1.set_xlim(-3, 3)
    ax_1.set_ylim(-3, 3)
    # Plot zoom of cost function
    ax_2 = fig.add_subplot(1,2,2)
    ws1_2, ws2_2, cost_ws_2 = get_cost_surface(0, 2, 0, 2, 100, cost_func)
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2 + 1)
    plot_points(ax_2, points)
    ax_2.set_xlim(0, 2)
    ax_2.set_ylim(0, 2)
    # Show the colorbar
    fig.subplots_adjust(right=0.8)
    cax = fig.add_axes([0.85, 0.12, 0.03, 0.78])
    cbar = fig.colorbar(surf_1, ticks=np.logspace(0, 8, 9), cax=cax)
    cbar.ax.set_ylabel('$\\xi$', fontsize=15, rotation=0, labelpad=20)
    cbar.set_ticklabels(['{:.0e}'.format(i) for i in np.logspace(0, 8, 9)])
    fig.suptitle('Cost surface', fontsize=15)
    return fig

def plot_gradient_over_time(points, get_grad_over_time):
    """Plot the gradients of the annotated point and how the evolve over time."""
    fig = plt.figure(figsize=(6.5, 4))  
    ax = plt.subplot(111)
    # Plot points
    for wx, wRec, c in points:
        grad_over_time = get_grad_over_time(wx, wRec)
        x = np.arange(-grad_over_time.shape[1]+1, 1, 1)
        plt.plot(x, np.sum(grad_over_time, axis=0), c+'-', label='({0}, {1})'.format(wx, wRec), linewidth=1, markersize=8)
    plt.xlim(0, -grad_over_time.shape[1]+1)
    # Set up plot axis
    plt.xticks(x)
    plt.yscale('symlog')
    plt.yticks([10**8, 10**6, 10**4, 10**2, 0, -10**2, -10**4, -10**6, -10**8])
    plt.xlabel('timestep k', fontsize=12)
    plt.ylabel('$\\frac{\\partial \\xi}{\\partial S_{k}}$', fontsize=20, rotation=0)
    plt.grid()
    plt.title('Unstability of gradient in backward propagation.\n(backpropagate from left to right)')
    # Set legend
    leg = plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), frameon=False, numpoints=1)
    leg.set_title('$(w_x, w_{rec})$', prop={'size':15})
    
def get_grad_over_time(wx, wRec):
    """Helper func to only get the gradient over time from wx and wRec."""
    S = forward_states(X, wx, wRec)
    grad_out = output_gradient(S[:,-1], t).sum()
    _, grad_over_time = backward_gradient(X, S, grad_out, wRec)
    return grad_over_time

# Plot cost surface and gradients

# Get and plot the cost surface figure with markers
fig = get_cost_surface_figure(lambda w1, w2: cost(forward_states(X, w1, w2)[:,-1] , t), points)

# Get the plots of the gradients changing by backpropagating.
plot_gradient_over_time(points, get_grad_over_time)
# Show figures
plt.show()

弹性优化算法

在上面的部分，咱们已经介绍了RNN的梯度是很是不稳定的，因此梯度在损失表面的跳跃度是很是大的，也就是说优化程序可能将最优值带到离真实最优值很远的地方，以下图：

根据在咱们神经网络里面的基础教程，梯度降低法更新参数的公式以下：

其中，W(i) 表示在第 i 次迭代时 W 的值，μ 是学习率。

在训练过程当中，当咱们取 w(x) = 1 和 w(rec) = 2时，偏差表面上的蓝色点的梯度值将达到 10^7。尽管咱们把学习率取的很是小，好比0.000001(1e-6)，可是参数 W 也将离开原来的距离 10 个单位，在咱们的模型中，这将会致使灾难性的结果。一个解决方案是咱们再下降学习率的值，可是这样作将致使，当梯度很小时，更新的点将保持在原地不动。

对于这个问题，研究者们已经找到了不少的方法来解决不稳定的梯度，好比Gradient clipping，Hessian-Free Optimization，Momentum。

咱们可使用一些优化算法来处理这个不稳定梯度，以此来减少梯度的敏感度。其中一个技术就是使用弹性反向传播（Rprop）。弹性反向传播算法和以前教程中的动量算法很是类似，可是这里只是用在梯度上面，用来更新参数。Rprop算法描述以下：

通常状况下，模型的超参数被设置为 η^+ = 1.2 和 η^- = 0.5 。若是咱们将这个Rprop算法和以前的动量算法进行对比的话，咱们能够发现：当梯度的符合不改变时，咱们将增长 20% 的权重；当梯度的符合改变时，咱们将减少 50% 的权重。注意，Rprop算法的更新值 Δ 相似于动量中的速度参数。不一样点是Rprop算法的值只是反映了动量中的速度的值，不包括方向。方向是由当前梯度的方向来决定的。

在这个教程中，咱们迭代这个Rprop算法 500 次。下图中的蓝色点就是在偏差表面的更新值。注意图中，尽管权重参数开始的位置是在一个很高的偏差值和一个很高的梯度位置，可是在咱们的迭代最后，Rprop算法仍是将最优值锁定在坐标 (1, 1) 左右。

# Define Rprop optimisation function
def update_rprop(X, t, W, W_prev_sign, W_delta, eta_p, eta_n):
    """
    Update Rprop values in one iteration.
    X: input data.
    t: targets.
    W: Current weight parameters.
    W_prev_sign: Previous sign of the W gradient.
    W_delta: Rprop update values (Delta).
    eta_p, eta_n: Rprop hyperparameters.
    """
    # Perform forward and backward pass to get the gradients
    S = forward_states(X, W[0], W[1])
    grad_out = output_gradient(S[:,-1], t)
    W_grads, _ = backward_gradient(X, S, grad_out, W[1])
    W_sign = np.sign(W_grads)  # Sign of new gradient
    # Update the Delta (update value) for each weight parameter seperately
    for i, _ in enumerate(W):
        if W_sign[i] == W_prev_sign[i]:
            W_delta[i] *= eta_p
        else:
            W_delta[i] *= eta_n
    return W_delta, W_sign

# Perform Rprop optimisation

# Set hyperparameters
eta_p = 1.2
eta_n = 0.5

# Set initial parameters
W = [-1.5, 2]  # [wx, wRec]
W_delta = [0.001, 0.001]  # Update values (Delta) for W
W_sign = [0, 0]  # Previous sign of W

ls_of_ws = [(W[0], W[1])]  # List of weights to plot
# Iterate over 500 iterations
for i in range(500):
    # Get the update values and sign of the last gradient
    W_delta, W_sign = update_rprop(X, t, W, W_sign, W_delta, eta_p, eta_n)
    # Update each weight parameter seperately
    for i, _ in enumerate(W):
        W[i] -= W_sign[i] * W_delta[i]
    ls_of_ws.append((W[0], W[1]))  # Add weights to list to plot

print('Final weights are: wx = {0},  wRec = {1}'.format(W[0], W[1]))

Final weights are: wx = 1.00135554721, wRec = 0.999674473785

# Plot the cost surface with the weights over the iterations.

# Define plot function
def plot_optimisation(ls_of_ws, cost_func):
    """Plot the optimisation iterations on the cost surface."""
    ws1, ws2 = zip(*ls_of_ws)
    # Plot figures
    fig = plt.figure(figsize=(10, 4))
    # Plot overview of cost function
    ax_1 = fig.add_subplot(1,2,1)
    ws1_1, ws2_1, cost_ws_1 = get_cost_surface(-3, 3, -3, 3, 100, cost_func)
    surf_1 = plot_surface(ax_1, ws1_1, ws2_1, cost_ws_1 + 1)
    ax_1.plot(ws1, ws2, 'b.')
    ax_1.set_xlim([-3,3])
    ax_1.set_ylim([-3,3])
    # Plot zoom of cost function
    ax_2 = fig.add_subplot(1,2,2)
    ws1_2, ws2_2, cost_ws_2 = get_cost_surface(0, 2, 0, 2, 100, cost_func)
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2 + 1)
    ax_2.set_xlim([0,2])
    ax_2.set_ylim([0,2])
    surf_2 = plot_surface(ax_2, ws1_2, ws2_2, cost_ws_2)
    ax_2.plot(ws1, ws2, 'b.')
    # Show the colorbar
    fig.subplots_adjust(right=0.8)
    cax = fig.add_axes([0.85, 0.12, 0.03, 0.78])
    cbar = fig.colorbar(surf_1, ticks=np.logspace(0, 8, 9), cax=cax)
    cbar.ax.set_ylabel('$\\xi$', fontsize=15)
    cbar.set_ticklabels(['{:.0e}'.format(i) for i in np.logspace(0, 8, 9)])
    plt.suptitle('Cost surface', fontsize=15)
    plt.show()
    
# Plot the optimisation
plot_optimisation(ls_of_ws, lambda w1, w2: cost(forward_states(X, w1, w2)[:,-1] , t))
plt.show()

测试模型

最后咱们编写测试代码。从代码的执行中，咱们能发现目标值和真实值很是的相近。若是咱们取模型输出值的最靠近的整数，那么预测值的输出将更加完美。

test_inpt = np.asmatrix([[0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1]])
print test_inpt
test_outpt = forward_states(test_inpt, W[0], W[1])[:,-1]
print 'Target output: {:d} vs Model output: {:.2f}'.format(test_inpt.sum(), test_outpt[0])

Target output: 5 vs Model output: 4.99

完整代码，点击这里

做者：chen_h
微信号 & QQ：862251340
简书地址：https://www.jianshu.com/p/160...

CoderPai 是一个专一于算法实战的平台，从基础的算法到人工智能算法都有设计。若是你对算法实战感兴趣，请快快关注咱们吧。加入AI实战微信群，AI实战QQ群，ACM算法微信群，ACM算法QQ群。长按或者扫描以下二维码，关注 “CoderPai” 微信号（coderpai）