TensorFlow RNN Cell源码解析

时间 2021-01-24

标签 python 网络 ide 函数学习优化 this spa 3d 栏目 Python 繁體版

原文原文链接

本文介绍下 RNN 及几种变种的结构和对应的 TensorFlow 源码实现，另外经过简单的实例来实现 TensorFlow RNN 相关类的调用。python

RNN

RNN，循环神经网络，Recurrent Neural Networks。人们思考问题每每不是从零开始的，好比阅读时咱们对每一个词的理解都会依赖于前面看到的一些信息，而不是把前面看的内容所有抛弃再去理解某处的信息。应用到深度学习上面，若是咱们想要学习去理解一些依赖上文的信息，RNN 即可以作到，它有一个循环的操做，可使其能够保留以前学习到的内容。网络

RNN 的结构以下：ide

在上图网络结构中，对于矩形块 A 的那部分，经过输入xt（t时刻的特征向量），它会输出一个结果ht（t时刻的状态或者输出）。网络中的循环结构使得某个时刻的状态可以传到下一个时刻。函数

这些循环的结构让 RNNs 看起来有些难以理解，但咱们能够把 RNNs 当作是一个普通的网络作了屡次复制后叠加在一块儿组成的，每一网络会把它的输出传递到下一个网络中。咱们能够把 RNNs 在时间步上进行展开，就获得下图这样：学习

因此最基本的 RNN Cell 输入就是 xt，它还会输出一个隐含内容传递到下一个 Cell，同时还会生成一个结果 ht，其最基本的结构如以下：优化

仅仅是输入的 xt 和隐藏状态进行 concat，而后通过线性变换后通过一个 tanh 激活函数便输出了，另外隐含内容和输出结果是相同的内容。ui

咱们来分析一下 TensorFlow 里面 RNN Cell 的实现。this

TensorFlow 实现 RNN Cell 的位置在 python/ops/rnn_cell_impl.py，首先其实现了一个 RNNCell 类，继承了 Layer 类，其内部有三个比较重要的方法，state_size()、output_size()、__call__() 方法，其中 state_size() 和 output_size() 方法设置为类属性，能够当作属性来调用，实现以下：spa

@property3d

def state_size(self):

"""size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers

or TensorShapes.

"""

raise NotImplementedError("Abstract method")

@property

def output_size(self):

"""Integer or TensorShape: size of outputs produced by this cell."""

raise NotImplementedError("Abstract method")

分别表明 Cell 的状态和输出维度，和 Cell 中的神经元数量有关，但这里两个方法都没有实现，意思是说咱们必需要实现一个子类继承 RNNCell 类并实现这两个方法。

另外对于 __call__() 方法，实际上就是当初始化的对象直接被调用的时候触发的方法，实现以下：

def __call__(self, inputs, state, scope=None):

if scope is not None:

with vs.variable_scope(scope,

custom_getter=self._rnn_get_variable) as scope:

return super(RNNCell, self).__call__(inputs, state, scope=scope)

else:

with vs.variable_scope(vs.get_variable_scope(),

custom_getter=self._rnn_get_variable):

return super(RNNCell, self).__call__(inputs, state)

其实是调用了父类 Layer 的 __call__() 方法，但父类中 __call__() 方法中又调用了 call() 方法，而 Layer 类的 call() 方法的实现以下：

def call(self, inputs, **kwargs):

return inputs

父类的 call() 方法实现很是简单，因此要实现其真正的功能，只须要在继承 RNNCell 类的子类中实现 call() 方法便可。

接下来咱们看下 RNN Cell 的最基本的实现，叫作 BasicRNNCell，其代码以下：

class BasicRNNCell(RNNCell):

"""The most basic RNN cell.

Args:

num_units: int, The number of units in the RNN cell.

activation: Nonlinearity to use. Default: `tanh`.

reuse: (optional) Python boolean describing whether to reuse variables

in an existing scope. If not `True`, and the existing scope already has

the given variables, an error is raised.

"""

def __init__(self, num_units, activation=None, reuse=None):

super(BasicRNNCell, self).__init__(_reuse=reuse)

self._num_units = num_units

self._activation = activation or math_ops.tanh

self._linear = None

@property

def state_size(self):

return self._num_units

@property

def output_size(self):

return self._num_units

def call(self, inputs, state):

"""Most basic RNN: output = new_state = act(W * input + U * state + B)."""

if self._linear is None:

self._linear = _Linear([inputs, state], self._num_units, True)

output = self._activation(self._linear([inputs, state]))

return output, output

能够看到在初始化的时候，最终要的一个参数是 num_units，意思就是这个 Cell 中神经元的个数，另外还有一个参数 activation 即默认使用的激活函数，默认使用的 tanh，reuse 表明该 Cell 是否能够被从新使用。

在 state_size()、output_size() 方法里，其返回的内容都是 num_units，即神经元的个数，接下来 call() 方法中，传入的参数为 inputs 和 state，即输入的 x 和上一次的隐含状态，首先实例化了一个 _Linear 类，这个类实际上就是作线性变换的类，将两者传递过来，而后直接调用，就实现了 w * [inputs, state] + b 的线性变换，其中 _Linear 类的 __call__() 方法实现以下：

def __call__(self, args):

if not self._is_sequence:

args = [args]

if len(args) == 1:

res = math_ops.matmul(args[0], self._weights)

else:

res = math_ops.matmul(array_ops.concat(args, 1), self._weights)

if self._build_bias:

res = nn_ops.bias_add(res, self._biases)

return res

很明显这里传递了 [inputs, state] 做为 __call__() 方法的 args，会执行 concat() 和 matmul() 方法，而后接着再执行 bias_add() 方法，这样就实现了线性变换。

最后回到 BasicRNNCell 的 call() 方法中，在 _linear() 方法外面又包括了一层 _activation() 方法，即对线性变换应用一次 tanh 激活函数处理，做为输出结果。

最后返回的结果是 output 和 output，第一个表明 output，第二个表明隐状态，其值也等于 output。

咱们用一个实例来感觉一下：

import tensorflow as tf

cell = tf.nn.rnn_cell.BasicRNNCell(num_units=128)

print(cell.state_size)

inputs = tf.placeholder(tf.float32, shape=[32, 100])

h0 = cell.zero_state(32, tf.float32)

output, h1 = cell(inputs=inputs, state=h0)

print(output, output.shape)

print(h1, h1.shape)

这里咱们首先初始化了一个神经元个数为 128 的 BasicRNNCell 类，而后构造了一个 shape 为 [32, 100] 的变量做为 inputs，其表明 batch_size 为 32, 维度为 100，随后初始化了初始隐藏状态，调用了 zero_state() 方法，而后直接调用 cell，其实是最终调用了其 call() 方法，最后获得 output 和 h1，打印输出结果：

128

Tensor("basic_rnn_cell/Tanh:0", shape=(32, 128), dtype=float32) (32, 128)

Tensor("basic_rnn_cell/Tanh:0", shape=(32, 128), dtype=float32) (32, 128)

能够看到，当输入变量维度为 100 的时候，通过一个 128 神经元 Cell 以后，输出维度变成了 128，其输出 shape 变成了 [32, 128]，且此时输出结果和隐藏状态是相同的。

LSTM

RNNs 的出现，主要是由于它们可以把之前的信息联系到如今，从而解决如今的问题。好比，利用前面的信息，可以帮助咱们理解当前的内容。

有时候，咱们在处理当前任务的时候，只须要看一下比较近的一些信息。好比在一个语言模型中，咱们要经过上文来预测一下个词多是什么，那么当咱们看到 “the clouds are in the?”时，不须要更多的信息，咱们就可以天然而然的想到下一个词应该是“sky”。在这样的状况下，咱们所要预测的内容和相关信息之间的间隔很小，这种状况下 RNNs 就可以利用过去的信息，很容易实现：

可是若是咱们想依赖前文距离很是远的信息时，普通的 RNN 就很是难以作到了，随着间隔信息的增大，RNN 难以对其作关联：

可是 LSTM 能够用来解决这个问题。

LSTM，Long Short Term Memory Networks，是 RNN 的一个变种，经试验它能够用来解决更多问题，并取得了很是好的效果。

LSTM Cell 的结构以下：

LSTMs 最关键的地方在于 Cell 的状态和结构图上面的那条横穿的水平线。

Cell 状态的传输就像一条传送带，向量从整个 Cell 中穿过，只是作了少许的线性操做。这种结构可以很轻松地实现信息从整个 Cell 中穿过而不作改变。

若只有上面的那条水平线是没办法实现添加或者删除信息的，信息的操做是是经过一种叫作门的结构来实现的。

这里咱们能够把门分为三个：遗忘门（Forget Gate）、传入门（Input Gate）、输出门（Output Gate）。

遗忘门（Forget Gate）

首先是 LSTM 要决定让那些信息继续经过这个 Cell，这是经过 Forget Gate 的 sigmoid 神经层来实现的。它的输入是ht−1和xt，输出是一个数值都在 0，1 之间的向量，表示让 Ct−1 的各部分信息经过的比重。 0 表示“不让任何信息经过”， 1 表示“让全部信息经过”。

传入门（Input Gate）

下一步是决定让多少新的信息加入到 Cell 中来，一个叫作 Input Gate 的 sigmoid 层决定哪些信息须要更新，一个 New Input 经过 tanh 生成一个向量，也就是备选的用来更新的内容，Ct~ 。在下一步，咱们把这两部分联合起来，对 Cell 的状态进行一个更新。

在通过 Forget Gate 和 Input Gate 处理后，咱们就能够对输入的 Ct-1 作更新了，即把Ct−1 更新为 Ct，首先咱们把旧的状态 Ct−1 和 ft 相乘，把一些不想保留的信息忘掉。而后加上 it∗Ct~，这部分信息就是咱们要添加的新内容，这样就能够完成对 Ct-1 的更新。

输出门（Output Gate）

最后咱们须要来决定输出什么值，输出主要是依赖于 Cell 的状态 Ct，可是又不只仅依赖于 Ct，而是须要通过一个过滤的处理。首先，咱们仍是使用一个 sigmoid 层来决定 Ct 中的哪部分信息会被输出。而后咱们把 Ct 经过一个 tanh 激活函数处理，而后把其输出和 sigmoid 计算出来的权重相乘，这样就获得了最后输出的结果。

到了最后，其输出结果有三个内容，其中输出结果就是最上面的箭头代指的内容，即最终计算的结果，隐层包括两部份内容，一个是 Ct，一个是最下方的 ht，咱们能够将其合并为一个变量来表示。

接下来咱们来看下 LSTMCell 的 TensorFlow 代码实现。

首先它的类是 BasicLSTMCell 类，继承了 RNNCell 类，其初始化方法 init() 实现以下：

def __init__(self, num_units, forget_bias=1.0,

state_is_tuple=True, activation=None, reuse=None):

super(BasicLSTMCell, self).__init__(_reuse=reuse)

if not state_is_tuple:

logging.warn("%s: Using a concatenated state is slower and will soon be "

"deprecated. Use state_is_tuple=True.", self)

self._num_units = num_units

self._forget_bias = forget_bias

self._state_is_tuple = state_is_tuple

self._activation = activation or math_ops.tanh

self._linear = None

这里必须传入的参数仍然是 num_units，即神经元的个数，而后 forget_bias 是初始化 Forget Gate 的偏置大小，state_is_tuple 指的是输出状态类型是元组类型，activation 表明默认激活函数，reuse 表明是否能够被重复使用。

接下来看下 state_size() 方法和 output_size() 方法，实现以下：

@property

def state_size(self):

return (LSTMStateTuple(self._num_units, self._num_units)

if self._state_is_tuple else 2 * self._num_units)

@property

def output_size(self):

return self._num_units

这里 state_size() 方法变了，由于输出的 state 须要将 Ct 和隐含状态合并，因此它须要包含两部分的内容，若是传入的参数 state_is_tuple 为 True 的话，状态会被表示成一个元组，不然会是 num_units 乘以 2 的数字，默认是元组形式。output_size() 方法则保持不变。

对于 call() 方法，其实现以下：

def call(self, inputs, state):

"""Long short-term memory cell (LSTM).

Args:

inputs: `2-D` tensor with shape `[batch_size x input_size]`.

state: An `LSTMStateTuple` of state tensors, each shaped

`[batch_size x self.state_size]`, if `state_is_tuple` has been set to

`True`. Otherwise, a `Tensor` shaped

`[batch_size x 2 * self.state_size]`.

Returns:

A pair containing the new hidden state, and the new state (either a

`LSTMStateTuple` or a concatenated state, depending on

`state_is_tuple`).

"""

sigmoid = math_ops.sigmoid

# Parameters of gates are concatenated into one multiply for efficiency.

if self._state_is_tuple:

c, h = state

else:

c, h = array_ops.split(value=state, num_or_size_splits=2, axis=1)

if self._linear is None:

self._linear = _Linear([inputs, h], 4 * self._num_units, True)

# i = input_gate, j = new_input, f = forget_gate, o = output_gate

i, j, f, o = array_ops.split(

value=self._linear([inputs, h]), num_or_size_splits=4, axis=1)

new_c = (

c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j))

new_h = self._activation(new_c) * sigmoid(o)

if self._state_is_tuple:

new_state = LSTMStateTuple(new_c, new_h)

else:

new_state = array_ops.concat([new_c, new_h], 1)

return new_h, new_state

首先为了获取 c, h，须要将其从 state 中分离开来，若是传入的 state 是元组的话能够直接分解，不然须要调用 split() 方法来分解：

if self._state_is_tuple:

c, h = state

else:

c, h = array_ops.split(value=state, num_or_size_splits=2, axis=1)

接下来定义了几个门的实现：

i, j, f, o = array_ops.split(value=self._linear([inputs, h]), num_or_size_splits=4, axis=1)

放到一块儿来用 Linear 计算而后分红了 4 份，分别表明 Input Gate、New Input、Forget Gate、Output Gate，用 i、j、f、o 来表示，这时候四个变量都通过了线性变换，乘以权重并作了偏置操做。

接下来就是更新 Ct-1 为 Ct 和获得隐含状态输出了，都是遵循 LSTM 内部的公式实现：

new_c = (c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j))

new_h = self._activation(new_c) * sigmoid(o)

这里值得注意的是还多加了一个 _forget_bias 变量，即设置了初始化偏置，以避免初始输出为 0 的问题。

最后将 new_c 和 new_h 进行合并，若是要输出元组，那么就合并为元组，不然两者进行 concat 操做，返回的结果是 new_h、new_state，前者即 Cell 的输出结果，后者表明隐含状态：

if self._state_is_tuple:

new_state = LSTMStateTuple(new_c, new_h)

else:

new_state = array_ops.concat([new_c, new_h], 1)

return new_h, new_state

咱们再用一个实例来感觉一下 BasicLSTMCell 的用法：

import tensorflow as tf

cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=128)

print(cell.state_size)

inputs = tf.placeholder(tf.float32, shape=(32, 100))

h0 = cell.zero_state(32, tf.float32)

output, h1 = cell(inputs=inputs, state=h0)

print(h1)

print(h1.h, h1.h.shape)

print(h1.c, h1.c.shape)

print(output, output.shape)

LSTMStateTuple(c=128, h=128)

LSTMStateTuple(c=<tf.Tensor 'add_1:0' shape=(32, 128) dtype=float32>, h=<tf.Tensor 'mul_2:0' shape=(32, 128) dtype=float32>)

Tensor("mul_2:0", shape=(32, 128), dtype=float32) (32, 128)

Tensor("add_1:0", shape=(32, 128), dtype=float32) (32, 128)

Tensor("mul_2:0", shape=(32, 128), dtype=float32) (32, 128)

能够看到其维度都是 [32, 128]，并且 h1.h 和 output 是相同的。

另外 LSTM 有许多变种，其中一个比较有名的就是 Gers & Schmidhuber (2000) 提出的，它在原来的基础上行添加了 Peephole Connections，使得遗忘门能够受 Ct-1 的影响。

另外还有一个变种就是将 Forget Gate 和 Input Gate 两者联合起来，作到要么遗忘老的输入新的，要么保留老的不输入新的。

但接下来还有一个更经常使用的变种，俺就是 GRU，它是由 Cho, et al. (2014) 提出的，在提出的同时他还提出了 Seq2Seq 模型，为 Generation Model 作好了铺垫。

GRU

GRU，Gated Recurrent Unit，在 GRU 中，只有两个门：重置门（Reset Gate）和更新门（Update Gate）。同时在这个结构中，把 Ct 和隐藏状态进行了合并，总体结构比标准的 LSTM 结构要简单，并且这个结构后来也很是流行。

接下来咱们看下 TensorFlow 中 GRUCell 的实现，代码以下：

class GRUCell(RNNCell):

"""Gated Recurrent Unit cell (cf. http://arxiv.org/abs/1406.1078).

Args:

num_units: int, The number of units in the GRU cell.

activation: Nonlinearity to use. Default: `tanh`.

reuse: (optional) Python boolean describing whether to reuse variables

in an existing scope. If not `True`, and the existing scope already has

the given variables, an error is raised.

kernel_initializer: (optional) The initializer to use for the weight and

projection matrices.

bias_initializer: (optional) The initializer to use for the bias.

"""

def __init__(self,

num_units,

activation=None,

reuse=None,

kernel_initializer=None,

bias_initializer=None):

super(GRUCell, self).__init__(_reuse=reuse)

self._num_units = num_units

self._activation = activation or math_ops.tanh

self._kernel_initializer = kernel_initializer

self._bias_initializer = bias_initializer

self._gate_linear = None

self._candidate_linear = None

@property

def state_size(self):

return self._num_units

@property

def output_size(self):

return self._num_units

def call(self, inputs, state):

"""Gated recurrent unit (GRU) with nunits cells."""

if self._gate_linear is None:

bias_ones = self._bias_initializer

if self._bias_initializer is None:

bias_ones = init_ops.constant_initializer(1.0, dtype=inputs.dtype)

with vs.variable_scope("gates"): # Reset gate and update gate.

self._gate_linear = _Linear(

[inputs, state],

2 * self._num_units,

True,

bias_initializer=bias_ones,

kernel_initializer=self._kernel_initializer)

value = math_ops.sigmoid(self._gate_linear([inputs, state]))

r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)

r_state = r * state

if self._candidate_linear is None:

with vs.variable_scope("candidate"):

self._candidate_linear = _Linear(

[inputs, r_state],

self._num_units,

True,

bias_initializer=self._bias_initializer,

kernel_initializer=self._kernel_initializer)

c = self._activation(self._candidate_linear([inputs, r_state]))

new_h = u * state + (1 - u) * c

return new_h, new_h

在 state_size()、output_size() 方法里，其返回的内容都是 num_units，即神经元的个数。

接下来 call() 方法中，由于 Reset Gate rt 和 Update Gate zt 分别用变量 r、u 表示，它们须要先对 ht-1 即 state 和 xt 作合并，而后再实现线性变换，再调用 sigmod 函数获得：

value = math_ops.sigmoid(self._gate_linear([inputs, state]))

r, u = array_ops.split(value=value, num_or_size_splits=2, axis=1)

而后须要求解 ht~，首先用 rt 和 ht-1 即 state 相乘：

r_state = r * state

而后将其放到线性函数里面，在调用 tanh 激活函数便可：

c = self._activation(self._candidate_linear([inputs, r_state]))

最后计算隐含状态和输出结果，两者一致：

new_h = u * state + (1 - u) * c

return new_h, new_h

这样便可返回获得输出结果和隐藏状态。

咱们用一个实例感觉一下：

import tensorflow as tf

cell = tf.nn.rnn_cell.GRUCell(num_units=128)

print(cell.state_size)

inputs = tf.placeholder(tf.float32, shape=[32, 100])

h0 = cell.zero_state(32, tf.float32)

output, h1 = cell(inputs=inputs, state=h0)

print(output, output.shape)

print(h1, h1.shape)

运行结果：

128

Tensor("gru_cell/add:0", shape=(32, 128), dtype=float32) (32, 128)

Tensor("gru_cell/add:0", shape=(32, 128), dtype=float32) (32, 128)

这个结果和 BasicRNNCell 并没有二致，但 GRUCell 内部的结构使模型的效果更加优化，通常咱们也会选取 GRUCell 来代替原生的 BasicRNNCell。

结语

以上即是对 RNN 及一些变种的说明及代码原理分析和实例用法，此部分掌握以后对 Dynamic RNN、多层 RNN 及 RNN Cell 的改写会有很大帮助，须要好好掌握。

TensorFlow RNN Cell源码解析

RNN

LSTM

遗忘门（Forget Gate）

传入门（Input Gate）

输出门 （Output Gate）

GRU

结语

输出门（Output Gate）