前言
【深度学习】从神经网络到卷积神经网络 html
前面咱们介绍了 BP
神经网络和卷积神经网络CNN
,那么为何还须要循环神经网络 RNN
呢?web
BP
神经网络和卷积神经网络CNN
的输入输出都是相互独立的,可是在实际应用中有些场景输出内容和以前的内容是有关联的
BP
神经网络和卷积神经网络CNN
有一个特色,就是假设输入是一个独立的没有上下文联系的单位,好比输入是一张图片,网络识别是狗仍是猫。可是对于一些有明显的上下文特征的序列化输入,好比预测视频中下一帧的播放内容,那么很明显这样的输出必须依赖之前的输入, 也就是说网络必须拥有必定的”记忆能力”。为了赋予网络这样的记忆力,一种特殊结构的神经网络——循环神经网络(Recurrent NeuralNetwork
)便应运而生了。算法
RNN
引入“记忆”的概念,循环指其每个元素都执行相同的任务,可是输出依赖于输入和“记忆”
RNN应用场景:天然语言处理、机器翻译、语音识别等网络
1、RNN
(循环神经网络)
循环神经网络是一类用于处理序列数据的神经网络,就像卷积神经网络是专门用于处理网格化数据(如一张图像)的神经网络,循环神经网络时专门用于处理序列
x
(
1
)
,
.
.
.
,
x
(
T
)
x^{(1)},...,x^{(T)}
x ( 1 ) , . . . , x ( T ) 的神经网络。app
RNN
网络结构以下:ide
循环神经网络的结果相比于卷积神经网络较简单,一般循环神经网络只包含输入层、隐藏层和输出层,加上输入输出层最多也就5层svg
将序列按时间展开就能够获得RNN的结构,以下图:函数
网络某一时刻的输入
x
t
x_t
x t ,和以前介绍的BP神经网络的输入同样,
x
t
x_t
x t 是一个
n
n
n 维向量,不一样的是递归网络的输入将是一整个序列,也就是
x
=
[
x
1
,
.
.
.
,
x
t
−
1
,
x
t
,
x
t
+
1
,
.
.
.
x
T
]
x=[x_1,...,x_{t-1},x_t,x_{t+1},...x_T]
x = [ x 1 , . . . , x t − 1 , x t , x t + 1 , . . . x T ] ,对于语言模型,每个
x
t
x_t
x t 将表明一个词向量,一整个序列就表明一句话。学习
h
t
h_t
h t 表明时刻
t
t
t 隐神经元对于线性转换值
s
t
s_t
s t 表明时刻
t
t
t 的隐藏状态, 即:“记忆”
o
t
o_t
o t 表明时刻
t
t
t 的输出,
输入层到隐藏层直接的权重由
U
U
U 表示
隐藏层到隐藏层的权重
W
W
W ,它是网络的记忆控制者,负责调度记忆。
隐藏层到输出层的权重
V
V
V
一、循环神经网络RNN-BPTT
RNN
的训练和 CNN/ANN
训练同样,一样适用 BP算法偏差反向传播算法
。spa
区别在于:
RNN
中的参数U\V\W
是共享的,而且在随机梯度降低算法中,每一步的输出不只仅依赖当前步的网络,而且还须要前若干步网络的状态,那么这种BP改版的算法叫作Backpropagation Through Time(BPTT)
;
BPTT算法
和BP算法
同样,在多层训练过程当中(长时依赖<即当前的输出和前面很长的一段序列有关,通常超过10步>),可能产生梯度消失和梯度爆炸的问题。
BPTT
和BP算法
思路同样,都是求偏导,区别在于须要考虑时间对step的影响
二、RNN
正向传播阶段
在
t
=
1
t=1
t = 1 的时刻,
U
,
V
,
W
U,V,W
U , V , W 都被随机初始化好,
s
0
s_0
s 0 一般初始化为0,而后进行以下计算:
h
1
=
U
x
1
+
W
s
0
h_1 = Ux_1+Ws_0
h 1 = U x 1 + W s 0
s
1
=
f
(
h
1
)
s_1 = f(h_1)
s 1 = f ( h 1 )
o
1
=
g
(
V
s
1
)
o_1 = g(Vs_1)
o 1 = g ( V s 1 )
在
t
=
2
t=2
t = 2 的时刻,,此时的状态
s
1
s_1
s 1 做为时刻1的记忆状态将参与下一个时刻的预测活动,也就是:
h
2
=
U
x
2
+
W
s
1
h_2 = Ux_2+Ws_1
h 2 = U x 2 + W s 1
s
2
=
f
(
h
2
)
s_2 = f(h_2)
s 2 = f ( h 2 )
o
2
=
g
(
V
s
2
)
o_2=g(Vs_2)
o 2 = g ( V s 2 )
以此类推,可得:
h
t
=
U
x
t
+
W
s
t
−
1
h_t = Ux_t+Ws_{t-1}
h t = U x t + W s t − 1
s
t
=
f
(
h
t
)
s_t = f(h_t)
s t = f ( h t )
o
t
=
g
(
V
s
t
)
o_t=g(Vs_t)
o t = g ( V s t )
其中
f
f
f 能够是 tanh
,relu
,sigmoid
等激活函数,
g
g
g 一般是 softmax
也能够是其余
值得注意的是,咱们说递归神经网络拥有记忆能力,而这种能力就是经过
W
W
W 将以往的输入状态进行总结,而做为下次输入的辅助
能够这样理解隐藏状态:
h
=
f
(
现
有
的
输
入
+
过
去
记
忆
总
结
)
h=f(现有的输入+过去记忆总结)
h = f ( 现 有 的 输 入 + 过 去 记 忆 总 结 )
三、RNN
反向传播阶段
BP神经网络
用到的偏差反向传播 方法将输出层的偏差总和,对各个权重的梯度
∇
U
\nabla U
∇ U ,
∇
V
\nabla V
∇ V ,
∇
W
\nabla W
∇ W ,求偏导数,而后利用梯度降低法更新各个权重。
对于每一时刻
t
t
t 的RNN网络
,网络的输出
o
t
o_t
o t 都会产生必定偏差
e
t
e_t
e t ,偏差的损失函数,能够是交叉熵也能够是平方偏差等等。那么总的偏差为
E
=
∑
t
e
t
E=\sum_t e_t
E = ∑ t e t ,咱们的目标就是要求取:
E
=
∑
t
e
t
E=\sum_t e_t
E = t ∑ e t
∇
U
=
∂
E
∂
U
=
∑
t
∂
e
t
∂
U
\nabla U = \frac{\partial E}{\partial U} = \sum_t\frac{\partial e_t}{\partial U}
∇ U = ∂ U ∂ E = t ∑ ∂ U ∂ e t
∇
V
=
∂
E
∂
V
=
∑
t
∂
e
t
∂
V
\nabla V = \frac{\partial E}{\partial V} = \sum_t\frac{\partial e_t}{\partial V}
∇ V = ∂ V ∂ E = t ∑ ∂ V ∂ e t
∇
W
=
∂
E
∂
W
=
∑
t
∂
e
t
∂
W
\nabla W = \frac{\partial E}{\partial W} = \sum_t\frac{\partial e_t}{\partial W}
∇ W = ∂ W ∂ E = t ∑ ∂ W ∂ e t
下面咱们以
t
=
3
t=3
t = 3 为例:
假设使用均方偏差,且真实值为
y
i
y_i
y i ,那么:
e
3
=
1
2
(
o
3
−
y
3
)
2
e_3 = \frac{1}{2}(o_3-y_3)^2
e 3 = 2 1 ( o 3 − y 3 ) 2
o
3
=
g
(
V
s
3
)
o_3=g(Vs_3)
o 3 = g ( V s 3 )
e
3
=
1
2
(
g
(
V
s
3
)
−
y
3
)
2
e_3=\frac{1}{2}(g(Vs_3)-y_3)^2
e 3 = 2 1 ( g ( V s 3 ) − y 3 ) 2
s
3
=
f
(
U
x
3
+
W
s
2
)
s_3=f(Ux_3+Ws_2)
s 3 = f ( U x 3 + W s 2 )
e
3
=
1
2
(
g
(
V
f
(
U
x
3
+
W
s
2
)
)
)
−
y
3
)
2
e_3=\frac{1}{2}(g(Vf(Ux_3+Ws_2)))-y_3)^2
e 3 = 2 1 ( g ( V f ( U x 3 + W s 2 ) ) ) − y 3 ) 2
求解
W
W
W 的偏导数:
上式和
W
W
W 有关的是
W
s
2
Ws_2
W s 2 ,很显然这是个复合函数
咱们即可以根据复合函数的求导方式,链式法则:
∂
e
3
∂
W
=
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
W
\frac{\partial e_3}{\partial W} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial W}
∂ W ∂ e 3 = ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ W ∂ s 3
下面便依次求解(若是使用均方差损失),那么:
e
3
=
1
2
(
o
3
−
y
3
)
2
e_3 = \frac{1}{2}(o_3-y_3)^2
e 3 = 2 1 ( o 3 − y 3 ) 2
∂
e
3
∂
o
3
=
o
3
−
y
3
\frac{\partial e_3}{\partial o_3} = o_3 - y_3
∂ o 3 ∂ e 3 = o 3 − y 3
o
3
=
g
(
V
s
3
)
o_3=g(Vs_3)
o 3 = g ( V s 3 )
∂
o
3
∂
s
3
=
g
′
V
\frac{\partial o_3}{\partial s_3}=g'V
∂ s 3 ∂ o 3 = g ′ V
g
′
g'
g ′ 表示函数 g 的导数
前面两个比较简单,重要的是第三项:
根据公式 :
s
t
=
f
(
U
x
t
+
W
s
t
−
1
)
s_t = f(Ux_t+Ws_{t-1})
s t = f ( U x t + W s t − 1 )
咱们会发现,
s
3
s_3
s 3 除了和
W
W
W 有关以外,还和前一时刻
s
2
s_2
s 2 有关
对于
s
3
s_3
s 3 直接展开获得下面的式子:
∂
s
3
∂
W
=
∂
s
3
∂
s
3
∂
s
3
+
∂
W
+
∂
s
3
∂
s
2
∂
s
2
∂
W
\frac{\partial s_3}{\partial W}=\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W} + \frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial W}
∂ W ∂ s 3 = ∂ s 3 ∂ s 3 ∂ W ∂ s 3 + + ∂ s 2 ∂ s 3 ∂ W ∂ s 2
其中
∂
s
3
+
∂
W
\frac{\partial s_3^+}{\partial W}
∂ W ∂ s 3 + 表示不作复合求导,将W之外的都当作常量
∂
s
2
∂
W
\frac{\partial s_2}{\partial W}
∂ W ∂ s 2 表示复合求导
对于
s
2
s_2
s 2 直接展开获得下面的式子:
∂
s
2
∂
W
=
∂
s
2
∂
s
2
∂
s
2
+
∂
W
+
∂
s
2
∂
s
1
∂
s
1
∂
W
\frac{\partial s_2}{\partial W}=\frac{\partial s_2}{\partial s_2}\frac{\partial s_2^+}{\partial W} + \frac{\partial s_2}{\partial s_1}\frac{\partial s_1}{\partial W}
∂ W ∂ s 2 = ∂ s 2 ∂ s 2 ∂ W ∂ s 2 + + ∂ s 1 ∂ s 2 ∂ W ∂ s 1
对于
s
1
s_1
s 1 直接展开获得下面的式子:
∂
s
1
∂
W
=
∂
s
1
∂
s
1
∂
s
1
+
∂
W
+
∂
s
1
∂
s
0
∂
s
0
∂
W
\frac{\partial s_1}{\partial W}=\frac{\partial s_1}{\partial s_1}\frac{\partial s_1^+}{\partial W} + \frac{\partial s_1}{\partial s_0}\frac{\partial s_0}{\partial W}
∂ W ∂ s 1 = ∂ s 1 ∂ s 1 ∂ W ∂ s 1 + + ∂ s 0 ∂ s 1 ∂ W ∂ s 0
将后两个展开的代入第一个获得:
∂
s
3
∂
W
=
∑
k
=
0
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial s_3}{\partial W}=\sum_{k=0}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ s 3 = k = 0 ∑ 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
最终:
∂
e
3
∂
W
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
另一种方式(假设咱们不考虑
f
f
f ):
s
t
=
U
x
t
+
W
s
t
−
1
s_t=Ux_t+Ws_{t-1}
s t = U x t + W s t − 1
s
3
=
U
x
3
+
W
s
2
s_3=Ux_3+Ws_{2}
s 3 = U x 3 + W s 2
∂
s
3
∂
W
=
s
2
+
W
∂
s
2
∂
W
\frac{\partial s_3}{\partial W} = s_2+W\frac{\partial s_2}{\partial W}
∂ W ∂ s 3 = s 2 + W ∂ W ∂ s 2
=
s
2
+
W
s
1
+
W
W
∂
s
1
∂
W
=s_2+Ws_1+WW\frac{\partial s_1}{\partial W}
= s 2 + W s 1 + W W ∂ W ∂ s 1
s
2
=
∂
s
3
∂
s
3
∂
s
3
+
∂
W
s_2 = \frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}
s 2 = ∂ s 3 ∂ s 3 ∂ W ∂ s 3 +
其中,
∂
s
3
∂
s
3
=
1
\frac{\partial s_3}{\partial s_3}=1
∂ s 3 ∂ s 3 = 1 ,
∂
s
3
+
∂
W
=
s
2
\frac{\partial s_3^+}{\partial W}=s_2
∂ W ∂ s 3 + = s 2 表示
s
3
s_3
s 3 对
W
W
W 求导,不作复合求导
s
2
=
U
x
2
+
W
s
1
s_2=Ux_2+Ws_{1}
s 2 = U x 2 + W s 1
W
s
1
=
∂
s
3
∂
s
2
∂
s
2
+
∂
W
Ws_1 =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}
W s 1 = ∂ s 2 ∂ s 3 ∂ W ∂ s 2 +
其中,
∂
s
3
∂
s
2
=
W
\frac{\partial s_3}{\partial s_2}=W
∂ s 2 ∂ s 3 = W ,
∂
s
2
+
∂
W
=
s
1
\frac{\partial s_2^+}{\partial W}=s_1
∂ W ∂ s 2 + = s 1
s
1
=
U
x
1
+
W
s
0
s_1=Ux_1+Ws_{0}
s 1 = U x 1 + W s 0
W
W
∂
s
1
∂
W
=
∂
s
3
∂
s
2
∂
s
2
∂
s
1
∂
s
1
+
∂
W
=
∂
s
3
∂
s
1
∂
s
1
+
∂
W
WW\frac{\partial s_1}{\partial W}=\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}
W W ∂ W ∂ s 1 = ∂ s 2 ∂ s 3 ∂ s 1 ∂ s 2 ∂ W ∂ s 1 + = ∂ s 1 ∂ s 3 ∂ W ∂ s 1 + 最终:
∂
s
3
∂
W
=
∂
s
3
∂
s
3
∂
s
3
+
∂
W
+
∂
s
3
∂
s
2
∂
s
2
+
∂
W
+
∂
s
3
∂
s
1
∂
s
1
+
∂
W
=
∑
k
=
1
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial s_3}{\partial W} =\frac{\partial s_3}{\partial s_3}\frac{\partial s_3^+}{\partial W}+\frac{\partial s_3}{\partial s_2}\frac{\partial s_2^+}{\partial W}+\frac{\partial s_3}{\partial s_1}\frac{\partial s_1^+}{\partial W}=\sum_{k=1}^3\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ s 3 = ∂ s 3 ∂ s 3 ∂ W ∂ s 3 + + ∂ s 2 ∂ s 3 ∂ W ∂ s 2 + + ∂ s 1 ∂ s 3 ∂ W ∂ s 1 + = k = 1 ∑ 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
∂
e
3
∂
W
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
s
k
∂
s
k
+
∂
W
\frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k^+}{\partial W}
∂ W ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ s k ∂ s 3 ∂ W ∂ s k +
根据上图,链式法则:
∂
e
3
∂
W
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
(
∏
j
=
k
+
1
3
∂
s
j
∂
s
j
−
1
)
∂
s
k
+
∂
W
\frac{\partial e_3}{\partial W} = \sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\Big(\prod_{j=k+1}^3\frac{\partial s_j}{\partial s_{j-1}}\Big)\frac{\partial s_k^+}{\partial W}
∂ W ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ( j = k + 1 ∏ 3 ∂ s j − 1 ∂ s j ) ∂ W ∂ s k +
求解
U
U
U 的偏导数:(和求
W
W
W 相似)
∂
e
3
∂
U
=
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
s
3
∂
U
\frac{\partial e_3}{\partial U} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial s_3}{\partial U}
∂ U ∂ e 3 = ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ U ∂ s 3
假设:
a
t
=
U
x
t
,
b
t
=
W
s
t
−
1
a_t = Ux_t,b_t=Ws_{t-1}
a t = U x t , b t = W s t − 1
s
t
=
f
(
a
t
+
b
t
)
s_t = f(a_t+b_t)
s t = f ( a t + b t )
求第三项,根据公式 :
s
3
=
f
(
U
x
3
+
W
s
2
)
s_3 = f(Ux_3+Ws_{2})
s 3 = f ( U x 3 + W s 2 )
∂
s
3
∂
U
=
f
′
×
(
∂
U
x
3
∂
U
+
W
∂
s
2
∂
U
)
\frac{\partial s_3}{\partial U}=f' \times (\frac{\partial Ux_3}{\partial U}+W\frac{\partial s_2}{\partial U})
∂ U ∂ s 3 = f ′ × ( ∂ U ∂ U x 3 + W ∂ U ∂ s 2 )
=
f
′
×
(
∂
U
x
3
∂
U
+
W
f
′
×
(
∂
U
x
2
∂
U
+
W
∂
s
1
∂
U
)
)
=f' \times (\frac{\partial Ux_3}{\partial U}+Wf' \times (\frac{\partial Ux_2}{\partial U}+W\frac{\partial s_1}{\partial U}))
= f ′ × ( ∂ U ∂ U x 3 + W f ′ × ( ∂ U ∂ U x 2 + W ∂ U ∂ s 1 ) )
=
f
′
×
(
∂
U
x
3
∂
U
+
W
f
′
×
(
∂
U
x
2
∂
U
+
W
f
′
×
(
∂
U
x
1
∂
U
+
W
∂
s
1
∂
U
)
)
)
=f' \times (\frac{\partial Ux_3}{\partial U}+Wf' \times (\frac{\partial Ux_2}{\partial U}+Wf' \times (\frac{\partial Ux_1}{\partial U}+W\frac{\partial s_1}{\partial U})))
= f ′ × ( ∂ U ∂ U x 3 + W f ′ × ( ∂ U ∂ U x 2 + W f ′ × ( ∂ U ∂ U x 1 + W ∂ U ∂ s 1 ) ) )
=
f
′
×
(
∂
U
x
3
∂
U
+
W
f
′
×
(
∂
U
x
2
∂
U
+
W
f
′
×
(
∂
U
x
1
∂
U
+
W
f
′
×
(
∂
U
x
0
∂
U
)
)
)
)
=f' \times \Bigg(\frac{\partial Ux_3}{\partial U}+Wf' \times \bigg(\frac{\partial Ux_2}{\partial U}+Wf' \times \Big(\frac{\partial Ux_1}{\partial U}+Wf' \times \big(\frac{\partial Ux_0}{\partial U}\big)\Big)\bigg)\Bigg)
= f ′ × ( ∂ U ∂ U x 3 + W f ′ × ( ∂ U ∂ U x 2 + W f ′ × ( ∂ U ∂ U x 1 + W f ′ × ( ∂ U ∂ U x 0 ) ) ) )
=
f
′
×
∂
U
x
3
∂
U
+
W
(
f
′
)
2
×
∂
U
x
2
∂
U
+
W
2
(
f
′
)
3
×
∂
U
x
1
∂
U
+
W
3
(
f
′
)
4
×
(
∂
U
x
0
∂
U
)
=f' \times \frac{\partial Ux_3}{\partial U}+W(f')^2 \times \frac{\partial Ux_2}{\partial U}+W^2(f')^3 \times \frac{\partial Ux_1}{\partial U}+W^3(f')^4 \times \big(\frac{\partial Ux_0}{\partial U}\big)
= f ′ × ∂ U ∂ U x 3 + W ( f ′ ) 2 × ∂ U ∂ U x 2 + W 2 ( f ′ ) 3 × ∂ U ∂ U x 1 + W 3 ( f ′ ) 4 × ( ∂ U ∂ U x 0 )
=
∑
k
=
0
3
(
f
′
)
4
−
k
∂
(
W
3
−
k
a
k
)
∂
U
=\sum_{k=0}^3 (f')^{4-k}\frac{\partial (W^{3-k}a_k)}{\partial U}
= k = 0 ∑ 3 ( f ′ ) 4 − k ∂ U ∂ ( W 3 − k a k )
∂
e
3
∂
U
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
(
W
3
−
k
a
k
)
∂
U
(
f
′
)
4
−
k
\frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}(f')^{4-k}
∂ U ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ U ∂ ( W 3 − k a k ) ( f ′ ) 4 − k
这里的结果我也不知道对不对,但愿了解的朋友指导下,很是感谢。
不考虑
f
f
f :
s
t
=
U
x
t
+
W
s
t
−
1
s_t=Ux_t+Ws_{t-1}
s t = U x t + W s t − 1
s
3
=
U
x
3
+
W
(
U
x
2
+
W
(
U
x
1
+
W
U
x
0
)
)
s_3=Ux_3+W\Big(Ux_2+W\big(Ux_1+WUx_0\big)\Big)
s 3 = U x 3 + W ( U x 2 + W ( U x 1 + W U x 0 ) )
=
U
x
3
+
W
U
x
2
+
W
2
U
x
1
+
W
3
U
x
0
=Ux_3+WUx_2+W^2Ux_1+W^3Ux_0
= U x 3 + W U x 2 + W 2 U x 1 + W 3 U x 0
s
3
=
a
3
+
W
a
2
+
W
2
a
1
+
W
3
a
0
s_3 = a_3+Wa_2+W^2a_1+W^3a_0
s 3 = a 3 + W a 2 + W 2 a 1 + W 3 a 0
∂
s
3
∂
U
=
∑
k
=
0
3
∂
(
W
3
−
k
a
k
)
∂
U
\frac{\partial s_3}{\partial U} =\sum_{k=0}^3 \frac{\partial (W^{3-k}a_k)}{\partial U}
∂ U ∂ s 3 = k = 0 ∑ 3 ∂ U ∂ ( W 3 − k a k )
∂
e
3
∂
U
=
∑
k
=
0
3
∂
e
3
∂
o
3
∂
o
3
∂
s
3
∂
(
W
3
−
k
a
k
)
∂
U
\frac{\partial e_3}{\partial U} =\sum_{k=0}^3\frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial s_3}\frac{\partial (W^{3-k}a_k)}{\partial U}
∂ U ∂ e 3 = k = 0 ∑ 3 ∂ o 3 ∂ e 3 ∂ s 3 ∂ o 3 ∂ U ∂ ( W 3 − k a k )
求解
V
V
V 的偏导数:
由于
V
V
V 只和输出
o
t
o_t
o t 有关有关,因此:
∂
e
3
∂
V
=
∂
e
3
∂
o
3
∂
o
3
∂
V
\frac{\partial e_3}{\partial V} = \frac{\partial e_3}{\partial o_3}\frac{\partial o_3}{\partial V}
∂ V ∂ e 3 = ∂ o 3 ∂ e 3 ∂ V ∂ o 3
四、RNN
缺陷
从咱们上面的推导过程,假如
t
=
0
t=0
t = 0 时刻的值,到
t
=
100
t=100
t = 1 0 0 时,因为前面的
W
W
W 次数过大,又可能会使其忘记
t
=
0
t=0
t = 0 时刻的信息,咱们称之为RNN
梯度消失,可是不是真正意思上的消失,由于梯度是累加的过程,不可能为0,只是在某个时刻的梯度过小,忘记了前面时刻的内容。
为了克服梯度消失的问题,LSTM和GRU模型便后续被推出了。因为它们都有特殊的方式存储“记忆”,那么之前梯度比较大的“记忆”不会像简单的RNN
同样立刻被抹除,所以能够必定程度 上克服梯度消失问题。
另外一个简单的技巧能够用来克服梯度爆炸的问题就是gradient clipping
,也就是当你计算的梯度超过阈值c的或者小于阈值−c时候,便把此时的梯度设置成c或−c。
下图所示是RNN
的偏差平面:
上图能够看到RNN的偏差平面要么很是陡峭,要么很是平坦,若是不采起任何措施,当你的参数在某一次更新以后,恰好碰到陡峭的地方,此时梯度变得很是大,那么你的参数更新也会很是大,很容易致使震荡问题。而若是你采起了gradient clipping这个技巧,那么即便你不幸碰到陡峭的地方,梯度也不会爆炸,由于梯度被限制在某个阈值c。
2、LSTM
(长短时间记忆网络)
因为在RNN
中,存在长期依赖的问题,可能产生梯度消失和梯度爆炸的问题。而LSTM
从名字就能够看出它特别适合解决这类须要长时间依赖的问题,相比于RNN
:
LSTM
的“记忆细胞(Cell)”改造了
该记录的信息会一直传递下去,不应记录的信息会被截断
下图是循环网络的展开结构:
其中的 A 部分的框便表示“记忆细胞”
RNN
的“记忆细胞” 以下:
只是经过简单的非线性映射
LSTM
的“记忆细胞” 以下:
增长了三个门,来控制“记忆细胞”
一、记忆细胞
细胞状态相似于传送带,直接在整个链上运行,只有一些少许的线性交互,信息在上面流传保持不变很容易。
LSTM 怎么控制“细胞状态”?
LSTM
能够经过 gates(“门”)
结构来去除或者增长“细胞状态”的信息
LSTM
中主要有三个“门”
结构来控制“细胞状态”
忘记门、信息增长门、输出门
二、忘记门
将上一时间点的输出和该时刻的输入进行 sigmoid
操做,输出一个0到1之间的几率值
该几率值描述了,每一个部分有多少许能够经过
若是该值为0,那么与
C
t
−
1
C_{t-1}
C t − 1 通过乘法操做后依然为0,表示“不容许任何变量经过”
若是该值为1,那么与
C
t
−
1
C_{t-1}
C t − 1 通过乘法操做后依然为
C
t
−
1
C_{t-1}
C t − 1 ,”表示“容许全部变量经过”
“忘记门”:决定从“细胞状态”中丢弃什么信息; 好比在语言模型中,细胞状态可能包含了性别信息(“他”或者“她”),当咱们看到新的代名词的时候,能够考虑忘记旧的数据
三、信息加强门
决定放什么新信息到“细胞状态”中;
Sigmoid层
决定什么值须要更新;
Tanh层
建立一个新的候选向量
C
~
t
\widetilde{C}_t
C
t ,主要是为了状态更新作准备
通过忘记门
、信息增长门
后,能够肯定传递信息的删除
和增长
,便可以进行“细胞状态”的更新
更新
C
t
−
1
C_{t-1}
C t − 1 为
C
t
C_t
C t
将旧状态与
f
t
f_t
f t 相乘,丢失掉肯定不要的信息
加上新的候选值
i
t
∗
C
t
i_t*C_t
i t ∗ C t 获得最终更新后的“细胞状态”
四、输出门
输出门是基于“细胞状态”获得输出:
首先运行一个sigmoid层
来肯定细胞状态的那个部分将输出
使用 tanh
处理细胞状态获得一个-1到1之间的值,再将它和sigmoid门
的输出相乘,输出程序肯定输出的部分。
五、LSTM
正向传播
f
t
=
σ
(
W
f
⋅
[
h
t
−
1
,
x
t
]
+
b
f
)
f_t = \sigma(W_f \cdot[h_{t-1}, x_t] + b_f)
f t = σ ( W f ⋅ [ h t − 1 , x t ] + b f )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 为
x
f
x_f
x f
i
t
=
σ
(
W
i
⋅
[
h
t
−
1
,
x
t
]
+
b
i
)
)
i_t = \sigma(W_i \cdot[h_{t-1}, x_t] + b_i))
i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 为
x
i
x_i
x i
C
~
t
=
t
a
n
h
(
W
C
⋅
[
h
t
−
1
,
x
t
]
+
b
C
)
\widetilde{C}_t = tanh(W_C \cdot [h_{t-1},x_t]+b_C)
C
t = t a n h ( W C ⋅ [ h t − 1 , x t ] + b C )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 为
x
C
x_C
x C
C
t
=
f
t
∗
C
t
−
1
+
i
t
∗
C
~
t
C_t = f_t * C_{t-1} + i_t * \widetilde{C}_t
C t = f t ∗ C t − 1 + i t ∗ C
t
o
t
=
σ
(
W
o
⋅
[
h
t
−
1
,
x
t
]
+
b
o
)
o_t=\sigma(W_o\cdot [h_{t-1}, x_t] + b_o)
o t = σ ( W o ⋅ [ h t − 1 , x t ] + b o )
取
[
h
t
−
1
,
x
t
]
[h_{t-1}, x_t]
[ h t − 1 , x t ] 为
x
o
x_o
x o
h
t
=
o
t
∗
t
a
n
h
(
C
t
)
h_t=o_t * tanh(C_t)
h t = o t ∗ t a n h ( C t )
y
^
t
=
W
y
⋅
h
t
+
b
y
\hat{y}_t=W_y \cdot h_t + b_y
y ^ t = W y ⋅ h t + b y
六、LSTM
反向传播
使用均方偏差:
E
=
∑
t
=
0
T
E
t
E = \sum_{t=0}^T E_t
E = t = 0 ∑ T E t
E
t
=
1
2
(
y
^
t
−
y
t
)
2
E_t = \frac{1}{2} (\hat{y}_t - y_t)^2
E t = 2 1 ( y ^ t − y t ) 2
∂
E
∂
W
y
=
∑
t
=
0
T
∂
E
t
∂
W
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
∂
y
^
t
∂
W
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
h
t
\frac{\partial E}{\partial W_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial W_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot h_t
∂ W y ∂ E = t = 0 ∑ T ∂ W y ∂ E t = t = 0 ∑ T ∂ y ^ t ∂ E t ∂ W y ∂ y ^ t = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ h t
∂
E
∂
b
y
=
∑
t
=
0
T
∂
E
t
∂
b
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
∂
y
^
t
∂
b
y
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
1
\frac{\partial E}{\partial b_y} = \sum_{t=0}^T\frac{\partial E_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial b_y}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot 1
∂ b y ∂ E = t = 0 ∑ T ∂ b y ∂ E t = t = 0 ∑ T ∂ y ^ t ∂ E t ∂ b y ∂ y ^ t = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ 1
由于
W
f
,
W
i
,
W
C
,
W
o
W_f,W_i,W_C,W_o
W f , W i , W C , W o 均和
h
t
h_t
h t 或
C
t
C_t
C t 有关系,因此求导法则都可写为关于
h
t
h_t
h t 或
C
t
C_t
C t 的链式法则
(1)先求
E
E
E 关于
h
t
h_t
h t 和
C
t
C_t
C t 的导数
上图中可知,
h
t
h_t
h t 和
C
t
C_t
C t 都有两条链路,所以导数包含两个部分
一个是当前时刻偏差的导数
另外一个是下一时刻到
T
T
T 时刻的全部偏差累积的导数
∂
E
∂
h
t
=
∂
E
t
∂
h
t
+
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
h
t
\frac{\partial E}{\partial h_t} =\frac{\partial E_t}{\partial h_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}
∂ h t ∂ E = ∂ h t ∂ E t + ∂ h t ∂ ( ∑ k = t + 1 T E k )
∂
E
∂
C
t
=
∂
E
t
∂
C
t
+
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
C
t
\frac{\partial E}{\partial C_t} =\frac{\partial E_t}{\partial C_t} + \frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}
∂ C t ∂ E = ∂ C t ∂ E t + ∂ C t ∂ ( ∑ k = t + 1 T E k )
∂
E
t
∂
h
t
=
∂
E
t
∂
y
^
t
∂
y
^
t
∂
h
t
=
∂
E
t
∂
y
^
t
⋅
W
y
T
\frac{\partial E_t}{\partial h_t} =\frac{\partial E_t}{\partial \hat{y}_t}\frac{\partial \hat{y}_t}{\partial h_t}=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T
∂ h t ∂ E t = ∂ y ^ t ∂ E t ∂ h t ∂ y ^ t = ∂ y ^ t ∂ E t ⋅ W y T
∂
E
t
∂
C
t
=
∂
E
t
∂
h
t
∂
h
t
∂
C
t
=
∂
E
t
∂
h
t
⋅
o
t
⋅
(
1
−
t
a
n
h
2
(
C
t
)
)
=
∂
E
t
∂
y
^
t
⋅
W
y
T
⋅
o
t
⋅
(
1
−
t
a
n
h
2
(
C
t
)
)
\frac{\partial E_t}{\partial C_t}=\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial C_t}= \frac{\partial E_t}{\partial h_t} \cdot o_t \cdot (1-tanh^2(C_t))=\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot o_t \cdot (1-tanh^2(C_t))
∂ C t ∂ E t = ∂ h t ∂ E t ∂ C t ∂ h t = ∂ h t ∂ E t ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) ) = ∂ y ^ t ∂ E t ⋅ W y T ⋅ o t ⋅ ( 1 − t a n h 2 ( C t ) )
如下两个如今求不出来,先用一个记号命名下:
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
h
t
=
d
h
n
e
x
t
\frac{\partial (\sum_{k=t+1}^TE_k)}{\partial h_t}=dh_{next}
∂ h t ∂ ( ∑ k = t + 1 T E k ) = d h n e x t
∂
(
∑
k
=
t
+
1
T
E
k
)
∂
C
t
=
d
C
n
e
x
t
\frac{\partial (\sum_{k=t+1}^TE_k)}{\partial C_t}=dC_{next}
∂ C t ∂ ( ∑ k = t + 1 T E k ) = d C n e x t
(2)求
W
o
W_o
W o 的偏导
∂
E
∂
W
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
W
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
o
t
∂
o
t
∂
W
o
\frac{\partial E}{\partial W_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial W_o}
∂ W o ∂ E = t = 0 ∑ T ∂ h t ∂ E t ∂ W o ∂ h t = t = 0 ∑ T ∂ h t ∂ E t ∂ o t ∂ h t ∂ W o ∂ o t
∂
h
t
∂
o
t
=
t
a
n
h
(
C
t
)
\frac{\partial h_t}{\partial o_t}=tanh(C_t)
∂ o t ∂ h t = t a n h ( C t )
∂
o
t
∂
W
o
=
o
t
⋅
(
1
−
o
t
)
⋅
x
o
T
\frac{\partial o_t}{\partial W_o} = o_t \cdot (1-o_t) \cdot x_o^T
∂ W o ∂ o t = o t ⋅ ( 1 − o t ) ⋅ x o T
∂
E
∂
W
o
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
W
y
T
⋅
t
a
n
h
(
C
t
)
⋅
o
t
⋅
(
1
−
o
t
)
⋅
x
o
T
\frac{\partial E}{\partial W_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t) \cdot x_o^T
∂ W o ∂ E = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t ) ⋅ x o T
(3)求
b
o
b_o
b o 的偏导
∂
E
∂
b
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
h
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
o
t
∂
o
t
∂
b
o
\frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial h_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial b_o}
∂ b o ∂ E = t = 0 ∑ T ∂ h t ∂ E t ∂ h o ∂ h t = t = 0 ∑ T ∂ h t ∂ E t ∂ o t ∂ h t ∂ b o ∂ o t
∂
h
t
∂
o
t
=
t
a
n
h
(
C
t
)
\frac{\partial h_t}{\partial o_t} = tanh(C_t)
∂ o t ∂ h t = t a n h ( C t )
∂
o
t
∂
b
o
=
o
t
(
1
−
o
t
)
\frac{\partial o_t}{\partial b_o}=o_t(1-o_t)
∂ b o ∂ o t = o t ( 1 − o t )
∂
E
∂
b
o
=
∑
t
=
0
T
∂
E
t
∂
y
^
t
⋅
W
y
T
⋅
t
a
n
h
(
C
t
)
⋅
o
t
⋅
(
1
−
o
t
)
\frac{\partial E}{\partial b_o} = \sum_{t=0}^T\frac{\partial E_t}{\partial \hat{y}_t}\cdot W_y^T \cdot tanh(C_t) \cdot o_t \cdot (1-o_t)
∂ b o ∂ E = t = 0 ∑ T ∂ y ^ t ∂ E t ⋅ W y T ⋅ t a n h ( C t ) ⋅ o t ⋅ ( 1 − o t )
(4)求
x
o
x_o
x o 的偏导
∂
E
∂
x
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
x
o
=
∑
t
=
0
T
∂
E
t
∂
h
t
∂
h
t
∂
o
t
∂
o
t
∂
x
o
\frac{\partial E}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial x_o}=\sum_{t=0}^T\frac{\partial E_t}{\partial h_t}\frac{\partial h_t}{\partial o_t}\frac{\partial o_t}{\partial x_o}
∂ x o ∂ E = t = 0 ∑ T ∂ h t ∂ E t ∂ x o ∂ h t = t = 0 ∑ T ∂ h t ∂ E t ∂ o t ∂ h t ∂ x o ∂ o t
∂
o
t
∂
x
o
=
o
t
(
1
−
o
t
)
⋅
W
o
T
\frac{\partial o_t}{\partial x_o}=o_t(1-o_t)\cdot W_o^T
∂ x o ∂ o t = o t ( 1 − o t ) ⋅ W o T
(5)求
W
C
W_C
W C 的偏导
∂
E
∂
W
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
W
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
C
~
t
∂
C
~
t
∂
W
C
\frac{\partial E}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial W_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial W_C}
∂ W C ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ W C ∂ C t = t = 0 ∑ T ∂ C t ∂ E t ∂ C
t ∂ C t ∂ W C ∂ C
t
∂
C
t
∂
C
~
t
=
i
t
\frac{\partial C_t}{\partial \widetilde{C}_t}=i_t
∂ C
t ∂ C t = i t
∂
C
~
t
∂
W
C
=
(
1
−
C
~
t
2
)
⋅
x
C
T
\frac{\partial \widetilde{C}_t}{\partial W_C}=(1-\widetilde{C}_t^2)\cdot x_C^T
∂ W C ∂ C
t = ( 1 − C
t 2 ) ⋅ x C T
(6)求
b
C
b_C
b C 的偏导
∂
E
∂
b
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
b
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
C
~
t
∂
C
~
t
∂
b
C
\frac{\partial E}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial b_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial b_C}
∂ b C ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ b C ∂ C t = t = 0 ∑ T ∂ C t ∂ E t ∂ C
t ∂ C t ∂ b C ∂ C
t
∂
C
~
t
∂
b
C
=
(
1
−
C
~
t
2
)
⋅
1
\frac{\partial \widetilde{C}_t}{\partial b_C}=(1-\widetilde{C}_t^2)\cdot 1
∂ b C ∂ C
t = ( 1 − C
t 2 ) ⋅ 1
(7)求
x
C
x_C
x C 的偏导
∂
E
∂
x
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
x
C
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
C
~
t
∂
C
~
t
∂
x
C
\frac{\partial E}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial x_C}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial \widetilde{C}_t}\frac{\partial \widetilde{C}_t}{\partial x_C}
∂ x C ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ x C ∂ C t = t = 0 ∑ T ∂ C t ∂ E t ∂ C
t ∂ C t ∂ x C ∂ C
t
∂
C
~
t
∂
x
C
=
(
1
−
C
~
t
2
)
⋅
W
C
T
\frac{\partial \widetilde{C}_t}{\partial x_C}=(1-\widetilde{C}_t^2)\cdot W_C^T
∂ x C ∂ C
t = ( 1 − C
t 2 ) ⋅ W C T
(8)求
W
i
,
b
i
,
x
i
W_i,b_i,x_i
W i , b i , x i 的偏导
∂
E
∂
W
i
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
i
t
∂
i
t
∂
W
i
\frac{\partial E}{\partial W_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial W_i}
∂ W i ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ i t ∂ C t ∂ W i ∂ i t
∂
E
∂
b
i
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
i
t
∂
i
t
∂
b
i
\frac{\partial E}{\partial b_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial b_i}
∂ b i ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ i t ∂ C t ∂ b i ∂ i t
∂
E
∂
x
i
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
i
t
∂
i
t
∂
x
i
\frac{\partial E}{\partial x_i}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial i_t}\frac{\partial i_t}{\partial x_i}
∂ x i ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ i t ∂ C t ∂ x i ∂ i t
∂
C
t
∂
i
t
=
C
~
t
\frac{\partial C_t}{\partial i_t}=\widetilde{C}_t
∂ i t ∂ C t = C
t
∂
i
t
∂
W
i
=
i
t
⋅
(
1
−
i
t
)
⋅
x
i
T
\frac{\partial i_t}{\partial W_i}=i_t\cdot (1-i_t) \cdot x_i^T
∂ W i ∂ i t = i t ⋅ ( 1 − i t ) ⋅ x i T
∂
i
t
∂
b
i
=
i
t
⋅
(
1
−
i
t
)
⋅
1
\frac{\partial i_t}{\partial b_i}= i_t\cdot (1-i_t) \cdot 1
∂ b i ∂ i t = i t ⋅ ( 1 − i t ) ⋅ 1
∂
i
t
∂
x
i
=
i
t
⋅
(
1
−
i
t
)
⋅
W
i
T
\frac{\partial i_t}{\partial x_i}=i_t\cdot (1-i_t) \cdot W_i^T
∂ x i ∂ i t = i t ⋅ ( 1 − i t ) ⋅ W i T
(9)求
W
f
,
b
f
,
x
f
W_f,b_f,x_f
W f , b f , x f 的偏导
∂
E
∂
W
f
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
f
t
∂
f
t
∂
W
f
\frac{\partial E}{\partial W_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial W_f}
∂ W f ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ f t ∂ C t ∂ W f ∂ f t
∂
E
∂
b
f
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
f
t
∂
f
t
∂
b
f
\frac{\partial E}{\partial b_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial b_f}
∂ b f ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ f t ∂ C t ∂ b f ∂ f t
∂
E
∂
x
f
=
∑
t
=
0
T
∂
E
t
∂
C
t
∂
C
t
∂
f
t
∂
f
t
∂
x
f
\frac{\partial E}{\partial x_f}=\sum_{t=0}^T\frac{\partial E_t}{\partial C_t}\frac{\partial C_t}{\partial f_t}\frac{\partial f_t}{\partial x_f}
∂ x f ∂ E = t = 0 ∑ T ∂ C t ∂ E t ∂ f t ∂ C t ∂ x f ∂ f t
∂
C
t
∂
f
t
=
C