【深度学习】从循环神经网络(RNN)到LSTM和GRU

前言

【深度学习】从神经网络到卷积神经网络html

前面咱们介绍了 BP 神经网络和卷积神经网络CNN,那么为何还须要循环神经网络 RNN 呢?web

  • BP 神经网络和卷积神经网络CNN的输入输出都是相互独立的,可是在实际应用中有些场景输出内容和以前的内容是有关联的

BP 神经网络和卷积神经网络CNN 有一个特色,就是假设输入是一个独立的没有上下文联系的单位,好比输入是一张图片,网络识别是狗仍是猫。可是对于一些有明显的上下文特征的序列化输入,好比预测视频中下一帧的播放内容,那么很明显这样的输出必须依赖之前的输入, 也就是说网络必须拥有必定的”记忆能力”。为了赋予网络这样的记忆力,一种特殊结构的神经网络——循环神经网络(Recurrent NeuralNetwork)便应运而生了。算法

  • RNN 引入“记忆”的概念,循环指其每个元素都执行相同的任务,可是输出依赖于输入和“记忆”

RNN应用场景:天然语言处理、机器翻译、语音识别等网络

1、RNN(循环神经网络)

  循环神经网络是一类用于处理序列数据的神经网络,就像卷积神经网络是专门用于处理网格化数据(如一张图像)的神经网络,循环神经网络时专门用于处理序列 x ( 1 ) , . . . , x ( T ) x^{(1)},...,x^{(T)} 的神经网络。app

RNN 网络结构以下:ide

在这里插入图片描述

循环神经网络的结果相比于卷积神经网络较简单,一般循环神经网络只包含输入层、隐藏层和输出层,加上输入输出层最多也就5层svg

将序列按时间展开就能够获得RNN的结构,以下图:函数

在这里插入图片描述

网络某一时刻的输入 x t x_t ,和以前介绍的BP神经网络的输入同样, x t x_t 是一个 n n 维向量,不一样的是递归网络的输入将是一整个序列,也就是 x = [ x 1 , . . . , x t 1 , x t , x t + 1 , . . . x T ] x=[x_1,...,x_{t-1},x_t,x_{t+1},...x_T] ,对于语言模型,每个 x t x_t 将表明一个词向量,一整个序列就表明一句话。学习

  • h t h_t 表明时刻 t t 隐神经元对于线性转换值
  • s t s_t 表明时刻 t t 的隐藏状态, 即:“记忆”
  • o t o_t 表明时刻 t t 的输出,
  • 输入层到隐藏层直接的权重由 U U 表示
  • 隐藏层到隐藏层的权重 W W ,它是网络的记忆控制者,负责调度记忆。
  • 隐藏层到输出层的权重 V V

一、循环神经网络RNN-BPTT

  RNN 的训练和 CNN/ANN 训练同样,一样适用 BP算法偏差反向传播算法spa

区别在于:

  • RNN中的参数U\V\W是共享的,而且在随机梯度降低算法中,每一步的输出不只仅依赖当前步的网络,而且还须要前若干步网络的状态,那么这种BP改版的算法叫作Backpropagation Through Time(BPTT)
  • BPTT算法BP算法同样,在多层训练过程当中(长时依赖<即当前的输出和前面很长的一段序列有关,通常超过10步>),可能产生梯度消失和梯度爆炸的问题。
  • BPTTBP算法思路同样,都是求偏导,区别在于须要考虑时间对step的影响

二、RNN 正向传播阶段

t = 1 t=1 的时刻, U , V , W U,V,W 都被随机初始化好, s 0 s_0 一般初始化为0,而后进行以下计算:

  • h 1 = U x 1 + W s 0 h_1 = Ux_1+Ws_0
  • s 1 = f ( h 1 ) s_1 = f(h_1)
  • o 1 = g ( V s 1 ) o_1 = g(Vs_1)

t = 2 t=2 的时刻,,此时的状态 s 1 s_1 做为时刻1的记忆状态将参与下一个时刻的预测活动,也就是:

  • h 2 = U x 2 + W s 1 h_2 = Ux_2+Ws_1
  • s 2 = f ( h 2 ) s_2 = f(h_2)
  • o 2 = g ( V s 2 ) o_2=g(Vs_2)

以此类推,可得:

  • h t = U x t + W s t 1 h_t = Ux_t+Ws_{t-1}
  • s t = f ( h t ) s_t = f(h_t)
  • o t = g ( V s t ) o_t=g(Vs_t)

其中 f f 能够是 tanhrelusigmoid等激活函数, g g 一般是 softmax 也能够是其余

  • 值得注意的是,咱们说递归神经网络拥有记忆能力,而这种能力就是经过 W W 将以往的输入状态进行总结,而做为下次输入的辅助
  • 能够这样理解隐藏状态: h = f ( + ) h=f(现有的输入+过去记忆总结)

三、RNN反向传播阶段

   BP神经网络 用到的偏差反向传播 方法将输出层的偏差总和,对各个权重的梯度 U \nabla U V \nabla V W \nabla W ,求偏导数,而后利用梯度降低法更新各个权重。

  对于每一时刻 t t RNN网络,网络的输出 o t o_t 都会产生必定偏差 e t e_t ,偏差的损失函数,能够是交叉熵也能够是平方偏差等等。那么总的偏差为 E = t e t E=\sum_t e_t ,咱们的目标就是要求取:

E = t e t E=\sum_t e_t

U = E U = t e t U \nabla U = \frac{\partial E}{\partial U} = \sum_t\frac{\partial e_t}{\partial U}

V = E V = t e t V \nabla V = \frac{\partial E}{\partial V} = \sum_t\frac{\partial e_t}{\partial V}

W = E W = t e t W \nabla W = \frac{\partial E}{\partial W} = \sum_t\frac{\partial e_t}{\partial W}

下面咱们以 t = 3 t=3 为例:

假设使用均方偏差,且真实值为 y i y_i ,那么:

e 3 = 1 2 ( o 3 y 3 ) 2 e_3 = \frac{1}{2}(o_3-y_3)^2

o 3 = g ( V s 3 ) o_3=g(Vs_3)

e 3 = 1 2 ( g ( V s 3 ) y 3 ) 2 e_3=\frac{1}{2}(g(Vs_3)-y_3)^2

s 3 = f ( U x 3 + W s 2 ) s_3=f(Ux_3+Ws_2)

e 3 = 1 2 ( g ( V f ( U x 3 + W s 2 ) ) ) y 3 ) 2 e_3=\frac{1}{2}(g(Vf(Ux_3+Ws_2)))-y_3)^2

求解 W W 的偏导数:

上式和 W W 有关的是 W s 2 Ws_2 ,很显然这是个复合函数

咱们即可以根据复合函数的求导方式,链式法则:

e 3 W = e 3 o 3 o 3 s 3 s 3 W \frac{\partial e_3}{\partial W} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial W}

下面便依次求解(若是使用均方差损失),那么:


e 3 = 1 2 ( o 3 y 3 ) 2 e_3 = \frac{1}{2}(o_3-y_3)^2

e 3 o 3 = o 3 y 3 \frac{\partial e_3}{\partial o_3} = o_3 - y_3


o 3 = g ( V s 3 ) o_3=g(Vs_3)

o 3 s 3 = g V \frac{\partial o_3}{\partial s_3}=g&#x27;V

g g&#x27; 表示函数 g 的导数


前面两个比较简单,重要的是第三项:

根据公式 :

s t = f ( U x t + W s t 1 ) s_t = f(Ux_t+Ws_{t-1})

咱们会发现, s 3 s_3 除了和 W W 有关以外,还和前一时刻 s 2 s_2 有关

对于 s 3 s_3 直接展开获得下面的式子:

s 3 W = s 3 s 3 s 3 + W + s 3 s 2 s 2 W \frac{\partial s_3}{\partial W}=\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} + \frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial W}

  • 其中 s 3 + W \frac{\partial s_3^+}{\partial W} 表示不作复合求导,将W之外的都当作常量
  • s 2 W \frac{\partial s_2}{\partial W} 表示复合求导

对于 s 2 s_2 直接展开获得下面的式子:

s 2 W = s 2 s 2 s 2 + W + s 2 s 1 s 1 W \frac{\partial s_2}{\partial W}=\frac{\partial s_2}{\partial s_2}\frac{\partial s_2^+}{\partial W} + \frac{\partial s_2}{\partial s_1}\frac{\partial s_1}{\partial W}

对于 s 1 s_1 直接展开获得下面的式子:

s 1 W = s 1 s 1 s 1 + W + s 1 s 0 s 0 W \frac{\partial s_1}{\partial W}=\frac{\partial s_1}{\partial s_1}\frac{\partial s_1^+}{\partial W} + \frac{\partial s_1}{\partial s_0}\frac{\partial s_0}{\partial W}

将后两个展开的代入第一个获得:

s 3 W = k = 0 3 s 3 s k s k + W \frac{\partial s_3}{\partial W}=\sum_{k=0}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}

最终:

e 3 W = k = 0 3 e 3 o 3 o 3 s 3 s 3 s k s k + W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}

另一种方式(假设咱们不考虑 f f ):
s t = U x t + W s t 1 s_t=Ux_t+Ws_{t-1}

s 3 = U x 3 + W s 2 s_3=Ux_3+Ws_{2}

s 3 W = s 2 + W s 2 W \frac{\partial s_3}{\partial W} = s_2+W\frac{\partial s_2}{\partial W}

= s 2 + W s 1 + W W s 1 W =s_2+Ws_1+WW\frac{\partial s_1}{\partial W}

  • s 2 = s 3 s 3 s 3 + W s_2 = \frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}
  • 其中, s 3 s 3 = 1 \frac{\partial s_3}{\partial s_3}=1 s 3 + W = s 2 \frac{\partial s_3^+}{\partial W}=s_2 表示 s 3 s_3 W W 求导,不作复合求导

s 2 = U x 2 + W s 1 s_2=Ux_2+Ws_{1}

  • W s 1 = s 3 s 2 s 2 + W Ws_1 =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}
  • 其中, s 3 s 2 = W \frac{\partial s_3}{\partial s_2}=W s 2 + W = s 1 \frac{\partial s_2^+}{\partial W}=s_1

s 1 = U x 1 + W s 0 s_1=Ux_1+Ws_{0}

W W s 1 W = s 3 s 2 s 2 s 1 s 1 + W = s 3 s 1 s 1 + W WW\frac{\partial s_1}{\partial W}=\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}
最终:
s 3 W = s 3 s 3 s 3 + W + s 3 s 2 s 2 + W + s 3 s 1 s 1 + W = k = 1 3 s 3 s k s k + W \frac{\partial s_3}{\partial W} =\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}+\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}+\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\sum_{k=1}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}

在这里插入图片描述

e 3 W = k = 0 3 e 3 o 3 o 3 s 3 s 3 s k s k + W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}

根据上图,链式法则:

e 3 W = k = 0 3 e 3 o 3 o 3 s 3 ( j = k + 1 3 s j s j 1 ) s k + W \frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\Big(\prod_{j=k+1}^3\frac{\partial s_j}{\partial s_{j-1}}\Big)\frac{\partial s_k^+}{\partial W}

求解 U U 的偏导数:(和求 W W 相似)

e 3 U = e 3 o 3 o 3 s 3 s 3 U \frac{\partial e_3}{\partial U} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial U}

假设: a t = U x t , b t = W s t 1 a_t = Ux_t,b_t=Ws_{t-1}

s t = f ( a t + b t ) s_t = f(a_t+b_t)

求第三项,根据公式 :

s 3 = f ( U x 3 + W s 2 ) s_3 = f(Ux_3+Ws_{2})

s 3 U = f × ( U x 3 U + W s 2 U ) \frac{\partial s_3}{\partial U}=f&#x27; \times (\frac{\partial Ux_3}{\partial U}+W\frac{\partial s_2}{\partial U})

= f × ( U x 3 U + W f × ( U x 2 U + W s 1 U ) ) =f&#x27; \times (\frac{\partial Ux_3}{\partial U}+Wf&#x27; \times (\frac{\partial Ux_2}{\partial U}+W\frac{\partial s_1}{\partial U}))

= f × ( U x 3 U + W f × ( U x 2 U + W f × ( U x 1 U + W s 1 U ) ) ) =f&#x27; \times (\frac{\partial Ux_3}{\partial U}+Wf&#x27; \times (\frac{\partial Ux_2}{\partial U}+Wf&#x27; \times (\frac{\partial Ux_1}{\partial U}+W\frac{\partial s_1}{\partial U})))

= f × ( U x 3 U + W f × ( U x 2 U + W f × ( U x 1 U + W f × ( U x 0 U ) ) ) ) =f&#x27; \times \Bigg(\frac{\partial Ux_3}{\partial U}+Wf&#x27; \times \bigg(\frac{\partial Ux_2}{\partial U}+Wf&#x27; \times \Big(\frac{\partial Ux_1}{\partial U}+Wf&#x27; \times \big(\frac{\partial Ux_0}{\partial U}\big)\Big)\bigg)\Bigg)

= f × U x 3 U + W ( f ) 2 × U x 2 U + W 2 ( f ) 3 × U x 1 U + W 3 ( f ) 4 × ( U x 0 U ) =f&#x27; \times \frac{\partial Ux_3}{\partial U}+W(f&#x27;)^2 \times \frac{\partial Ux_2}{\partial U}+W^2(f&#x27;)^3 \times \frac{\partial Ux_1}{\partial U}+W^3(f&#x27;)^4 \times \big(\frac{\partial Ux_0}{\partial U}\big)

= k = 0 3 ( f ) 4 k ( W 3 k a k ) U =\sum_{k=0}^3 (f&#x27;)^{4-k}\frac{\partial (W^{3-k}a_k)}{\partial U}

e 3 U = k = 0 3 e 3 o 3 o 3 s 3 ( W 3 k a k ) U ( f ) 4 k \frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}(f&#x27;)^{4-k}

这里的结果我也不知道对不对,但愿了解的朋友指导下,很是感谢。

不考虑 f f
s t = U x t + W s t 1 s_t=Ux_t+Ws_{t-1}
s 3 = U x 3 + W ( U x 2 + W ( U x 1 + W U x 0 ) ) s_3=Ux_3+W\Big(Ux_2+W\big(Ux_1+WUx_0\big)\Big)
= U x 3 + W U x 2 + W 2 U x 1 + W 3 U x 0 =Ux_3+WUx_2+W^2Ux_1+W^3Ux_0
s 3 = a 3 + W a 2 + W 2 a 1 + W 3 a 0 s_3 = a_3+Wa_2+W^2a_1+W^3a_0
s 3 U = k = 0 3 ( W 3 k a k ) U \frac{\partial s_3}{\partial U} =\sum_{k=0}^3 \frac{\partial (W^{3-k}a_k)}{\partial U}
e 3 U = k = 0 3 e 3 o 3 o 3 s 3 ( W 3 k a k ) U \frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}

求解 V V 的偏导数:

由于 V V 只和输出 o t o_t 有关有关,因此:

e 3 V = e 3 o 3 o 3 V \frac{\partial e_3}{\partial V} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial V}

四、RNN缺陷

  从咱们上面的推导过程,假如 t = 0 t=0 时刻的值,到 t = 100 t=100 时,因为前面的 W W 次数过大,又可能会使其忘记 t = 0 t=0 时刻的信息,咱们称之为RNN梯度消失,可是不是真正意思上的消失,由于梯度是累加的过程,不可能为0,只是在某个时刻的梯度过小,忘记了前面时刻的内容。

  为了克服梯度消失的问题,LSTM和GRU模型便后续被推出了。因为它们都有特殊的方式存储“记忆”,那么之前梯度比较大的“记忆”不会像简单的RNN同样立刻被抹除,所以能够必定程度
上克服梯度消失问题。

另外一个简单的技巧能够用来克服梯度爆炸的问题就是gradient clipping,也就是当你计算的梯度超过阈值c的或者小于阈值−c时候,便把此时的梯度设置成c或−c。

下图所示是RNN的偏差平面:
在这里插入图片描述

上图能够看到RNN的偏差平面要么很是陡峭,要么很是平坦,若是不采起任何措施,当你的参数在某一次更新以后,恰好碰到陡峭的地方,此时梯度变得很是大,那么你的参数更新也会很是大,很容易致使震荡问题。而若是你采起了gradient clipping这个技巧,那么即便你不幸碰到陡峭的地方,梯度也不会爆炸,由于梯度被限制在某个阈值c。

2、LSTM(长短时间记忆网络)

  因为在RNN中,存在长期依赖的问题,可能产生梯度消失和梯度爆炸的问题。而LSTM从名字就能够看出它特别适合解决这类须要长时间依赖的问题,相比于RNN

  • LSTM 的“记忆细胞(Cell)”改造了
  • 该记录的信息会一直传递下去,不应记录的信息会被截断

下图是循环网络的展开结构:
在这里插入图片描述

其中的 A 部分的框便表示“记忆细胞”

RNN 的“记忆细胞” 以下:

在这里插入图片描述

只是经过简单的非线性映射

LSTM 的“记忆细胞” 以下:

在这里插入图片描述

增长了三个门,来控制“记忆细胞”

一、记忆细胞

  细胞状态相似于传送带,直接在整个链上运行,只有一些少许的线性交互,信息在上面流传保持不变很容易。

在这里插入图片描述

LSTM 怎么控制“细胞状态”?

  • LSTM 能够经过 gates(“门”) 结构来去除或者增长“细胞状态”的信息
  • LSTM 中主要有三个“门”结构来控制“细胞状态”
  • 忘记门、信息增长门、输出门

二、忘记门

在这里插入图片描述

  • 将上一时间点的输出和该时刻的输入进行 sigmoid 操做,输出一个0到1之间的几率值
  • 该几率值描述了,每一个部分有多少许能够经过
  • 若是该值为0,那么与 C t 1 C_{t-1} 通过乘法操做后依然为0,表示“不容许任何变量经过”
  • 若是该值为1,那么与 C t 1 C_{t-1} 通过乘法操做后依然为 C t 1 C_{t-1} ,”表示“容许全部变量经过”

“忘记门”:决定从“细胞状态”中丢弃什么信息;
好比在语言模型中,细胞状态可能包含了性别信息(“他”或者“她”),当咱们看到新的代名词的时候,能够考虑忘记旧的数据

三、信息加强门

在这里插入图片描述

  • 决定放什么新信息到“细胞状态”中;
  • Sigmoid层 决定什么值须要更新;
  • Tanh层 建立一个新的候选向量 C ~ t \widetilde{C}_t ,主要是为了状态更新作准备

在这里插入图片描述
通过忘记门信息增长门后,能够肯定传递信息的删除增长,便可以进行“细胞状态”的更新

  • 更新 C t 1 C_{t-1} C t C_t
  • 将旧状态与 f t f_t 相乘,丢失掉肯定不要的信息
  • 加上新的候选值 i t C t i_t*C_t 获得最终更新后的“细胞状态”

四、输出门

在这里插入图片描述
输出门是基于“细胞状态”获得输出:

  • 首先运行一个sigmoid层来肯定细胞状态的那个部分将输出
  • 使用 tanh 处理细胞状态获得一个-1到1之间的值,再将它和sigmoid门的输出相乘,输出程序肯定输出的部分。

五、LSTM 正向传播

f t = σ ( W f [ h t 1 , x t ] + b f ) f_t = \sigma(W_f \cdot[h_{t-1}, x_t] + b_f)

[ h t 1 , x t ] [h_{t-1}, x_t] x f x_f

i t = σ ( W i [ h t 1 , x t ] + b i ) ) i_t = \sigma(W_i \cdot[h_{t-1}, x_t] + b_i))

[ h t 1 , x t ] [h_{t-1}, x_t] x i x_i

C ~ t = t a n h ( W C [ h t 1 , x t ] + b C ) \widetilde{C}_t = tanh(W_C \cdot [h_{t-1},x_t]+b_C)

[ h t 1 , x t ] [h_{t-1}, x_t] x C x_C

C t = f t C t 1 + i t C ~ t C_t = f_t * C_{t-1} + i_t * \widetilde{C}_t

o t = σ ( W o [ h t 1 , x t ] + b o ) o_t=\sigma(W_o\cdot [h_{t-1}, x_t] + b_o)

[ h t 1 , x t ] [h_{t-1}, x_t] x o x_o

h t = o t t a n h ( C t ) h_t=o_t * tanh(C_t)

y ^ t = W y h t + b y \hat{y}_t=W_y \cdot h_t + b_y

六、LSTM 反向传播

使用均方偏差:

E = t = 0 T E t E = \sum_{t=0}^T E_t

E t = 1 2 ( y ^ t y t ) 2 E_t = \frac{1}{2} (\hat{y}_t - y_t)^2

E W y = t = 0 T E t W y = t = 0 T E t y ^ t y ^ t W y = t = 0 T E t y ^ t h t \frac{\partial E}{\partial W_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot h_t

E b y = t = 0 T E t b y = t = 0 T E t y ^ t y ^ t b y = t = 0 T E t y ^ t 1 \frac{\partial E}{\partial b_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot 1

由于 W f W i W C W o W_f,W_i,W_C,W_o 均和 h t h_t C t C_t 有关系,因此求导法则都可写为关于 h t h_t C t C_t 的链式法则

(1)先求 E E 关于 h t h_t C t C_t 的导数

在这里插入图片描述

上图中可知, h t h_t C t C_t 都有两条链路,所以导数包含两个部分

  • 一个是当前时刻偏差的导数
  • 另外一个是下一时刻到 T T 时刻的全部偏差累积的导数

E h t = E t h t + ( k = t + 1 T E k ) h t \frac{\partial E}{\partial h_t} =\frac{\partial E_t}{\partial h_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}

E C t = E t C t + ( k = t + 1 T E k ) C t \frac{\partial E}{\partial C_t} =\frac{\partial E_t}{\partial C_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}

E t h t = E t y ^ t y ^ t h t = E t y ^ t W y T \frac{\partial E_t}{\partial h_t} =\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial h_t}=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T

E t C t = E t h t h t C t = E t h t o t ( 1 t a n h 2 ( C t ) ) = E t y ^ t W y T o t ( 1 t a n h 2 ( C t ) ) \frac{\partial E_t}{\partial C_t}=\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial C_t}= \frac{\partial E_t}{\partial h_t} \cdot o_t \cdot (1-tanh^2(C_t))=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot o_t \cdot (1-tanh^2(C_t))

如下两个如今求不出来,先用一个记号命名下:

( k = t + 1 T E k ) h t = d h n e x t \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}=dh_{next}

( k = t + 1 T E k ) C t = d C n e x t \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}=dC_{next}

(2)求 W o W_o 的偏导

E W o = t = 0 T E t h t h t W o = t = 0 T E t h t h t o t o t W o \frac{\partial E}{\partial W_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial W_o}

h t o t = t a n h ( C t ) \frac{\partial h_t}{\partial o_t}=tanh(C_t)

o t W o = o t ( 1 o t ) x o T \frac{\partial o_t}{\partial W_o} = o_t \cdot (1-o_t) \cdot x_o^T

E W o = t = 0 T E t y ^ t W y T t a n h ( C t ) o t ( 1 o t ) x o T \frac{\partial E}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) \cdot x_o^T

(3)求 b o b_o 的偏导

E b o = t = 0 T E t h t h t h o = t = 0 T E t h t h t o t o t b o \frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial h_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial b_o}

h t o t = t a n h ( C t ) \frac{\partial h_t}{\partial o_t} = tanh(C_t)

o t b o = o t ( 1 o t ) \frac{\partial o_t}{\partial b_o}=o_t(1-o_t)

E b o = t = 0 T E t y ^ t W y T t a n h ( C t ) o t ( 1 o t ) \frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t)

(4)求 x o x_o 的偏导

E x o = t = 0 T E t h t h t x o = t = 0 T E t h t h t o t o t x o \frac{\partial E}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial x_o}

o t x o = o t ( 1 o t ) W o T \frac{\partial o_t}{\partial x_o}=o_t(1-o_t)\cdot W_o^T

(5)求 W C W_C 的偏导

E W C = t = 0 T E t C t C t W C = t = 0 T E t C t C t C ~ t C ~ t W C \frac{\partial E}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial W_C}

C t C ~ t = i t \frac{\partial C_t}{\partial \widetilde{C}_t}=i_t

C ~ t W C = ( 1 C ~ t 2 ) x C T \frac{\partial \widetilde{C}_t}{\partial W_C}=(1-\widetilde{C}_t^2)\cdot x_C^T

(6)求 b C b_C 的偏导

E b C = t = 0 T E t C t C t b C = t = 0 T E t C t C t C ~ t C ~ t b C \frac{\partial E}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial b_C}

C ~ t b C = ( 1 C ~ t 2 ) 1 \frac{\partial \widetilde{C}_t}{\partial b_C}=(1-\widetilde{C}_t^2)\cdot 1

(7)求 x C x_C 的偏导

E x C = t = 0 T E t C t C t x C = t = 0 T E t C t C t C ~ t C ~ t x C \frac{\partial E}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial x_C}

C ~ t x C = ( 1 C ~ t 2 ) W C T \frac{\partial \widetilde{C}_t}{\partial x_C}=(1-\widetilde{C}_t^2)\cdot W_C^T

(8)求 W i b i x i W_i,b_i,x_i 的偏导

E W i = t = 0 T E t C t C t i t i t W i \frac{\partial E}{\partial W_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial W_i}

E b i = t = 0 T E t C t C t i t i t b i \frac{\partial E}{\partial b_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial b_i}

E x i = t = 0 T E t C t C t i t i t x i \frac{\partial E}{\partial x_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial x_i}

C t i t = C ~ t \frac{\partial C_t}{\partial i_t}=\widetilde{C}_t

i t W i = i t ( 1 i t ) x i T \frac{\partial i_t}{\partial W_i}=i_t\cdot (1-i_t) \cdot x_i^T

i t b i = i t ( 1 i t ) 1 \frac{\partial i_t}{\partial b_i}= i_t\cdot (1-i_t) \cdot 1

i t x i = i t ( 1 i t ) W i T \frac{\partial i_t}{\partial x_i}=i_t\cdot (1-i_t) \cdot W_i^T

(9)求 W f b f x f W_f,b_f,x_f 的偏导

E W f = t = 0 T E t C t C t f t f t W f \frac{\partial E}{\partial W_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial W_f}

E b f = t = 0 T E t C t C t f t f t b f \frac{\partial E}{\partial b_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial b_f}

E x f = t = 0 T E t C t C t f t f t x f \frac{\partial E}{\partial x_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial x_f}

C t f t = C

相关文章
相关标签/搜索