本文经过对《Understanding the difficulty of training deep feedforward neural networks》文章翻译和解读,和知乎、CSDN几位博主的文章总结、分析深度网络初始化方法。python
敲黑板:这里有一个发文章技巧,行不行有待验证你们应该感受到通常的深度学习文章上来就是实验,告诉读者这个实验结果好,而后由实验结果再反向给出一些无从验证的可能对可能不对的缘由。而这篇文章虽然总体来看比较简单,但结构很是严谨:首先经过实验分析标准初始化方法的问题;而后根据两个目标——状态方差和梯度方差保持不变推导出参数的特色,给出Xavier初始化方法的具体形式;最后经过实验验证Xavier初始化的效果确实不错。bash
- 实验
- 告诉读者实验结果好
- 由结果反正无从验证的可能对与不对的缘由,设立目标
- 根据目标验证猜测效果不错
文章翻译解读网络
分析的前提:dom
1.前向传播:用文中的话说:From a forward-propagation point of view, to keep information flowing we would like that:推出这玩意来了之后呢, 下面是关键:函数
1.前向传播:用文中的话说:From a forward-propagation point of view, to keep information flowing we would like that:学习
在训练过程当中,梯度问题:ui
这时,咱们就不能单纯地用梯度的 variance 去分析了,由于已经不知足咱们的假设条件了啊。spa
文章后面的一大堆基本没有什么重点的东西了吧,我以为。写几个以为有必要的总结吧:.net
softsign激活函数与双曲正切函数相比,效果还 很不错的,翻译
normalized initialization 的方法很不错。
咱们在线性回归,logistics回归的时候,基本上都是把参数初始化为0,咱们的模型也可以很好的工做。而后在神经网络中,把w初始化为0是不能够的。这是由于若是把w初始化0,那么每一层的神经元学到的东西都是同样的(输出是同样的),并且在bp的时候,每一层内的神经元也是相同的,由于他们的gradient相同。下面用一段代码来演示,当把w初始化为0:
def initialize_parameters_zeros(layers_dims):
""" Arguments: layer_dims -- python array (list) containing the size of each layer. Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": W1 -- weight matrix of shape (layers_dims[1], layers_dims[0]) b1 -- bias vector of shape (layers_dims[1], 1) ... WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL -- bias vector of shape (layers_dims[L], 1) """
parameters = {}
np.random.seed(3)
L = len(layers_dims) # number of layers in the network
for l in range(1, L):
parameters['W' + str(l)] = np.zeros((layers_dims[l], layers_dims[l - 1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
return parameters
复制代码
咱们能够看看cost function是如何变化的:
目前经常使用的就是随机初始化,即W随机初始化。随机初始化的代码以下:
def initialize_parameters_random(layers_dims):
""" Arguments: layer_dims -- python array (list) containing the size of each layer. Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": W1 -- weight matrix of shape (layers_dims[1], layers_dims[0]) b1 -- bias vector of shape (layers_dims[1], 1) ... WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL -- bias vector of shape (layers_dims[L], 1) """
np.random.seed(3) # This seed makes sure your "random" numbers will be the as ours
parameters = {}
L = len(layers_dims) # integer representing the number of layers
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1])*0.01
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
return parameters
复制代码
乘0.01是由于要把W随机初始化到一个相对较小的值,由于若是X很大的话,W又相对较大,会致使Z很是大,这样若是激活函数是sigmoid,就会致使sigmoid的输出值1或者0,而后会致使一系列问题(好比cost function计算的时候,log里是0,这样会有点麻烦)。
随机初始化后,cost function随着迭代次数的变化示意图为:
import numpy as np
import matplotlib.pyplot as plt
def initialize_parameters(layer_dims):
""" :param layer_dims: list,每一层单元的个数(维度) :return:dictionary,存储参数w1,w2,...,wL,b1,...,bL """
np.random.seed(3)
L = len(layer_dims)#the number of layers in the network
parameters = {}
for l in range(1,L):
parameters["W" + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01
parameters["b" + str(l)] = np.zeros((layer_dims[l],1))
return parameters
def forward_propagation():
data = np.random.randn(1000, 100000)
# layer_sizes = [100 - 10 * i for i in range(0,5)]
layer_sizes = [1000,800,500,300,200,100,10]
num_layers = len(layer_sizes)
parameters = initialize_parameters(layer_sizes)
A = data
for l in range(1,num_layers):
A_pre = A
W = parameters["W" + str(l)]
b = parameters["b" + str(l)]
z = np.dot(W,A_pre) + b #计算z = wx + b
A = np.tanh(z)
#画图
plt.subplot(2,3,l)
plt.hist(A.flatten(),facecolor='g')
plt.xlim([-1,1])
plt.yticks([])
plt.show()
复制代码
3.Xavier initialization Xavier initialization是 Glorot 等人为了解决随机初始化的问题提出来的另外一种初始化方法,他们的思想倒也简单,就是尽量的让输入和输出服从相同的分布,这样就可以避免后面层的激活函数的输出值趋向于0。他们的初始化方法为:
def initialize_parameters_he(layers_dims):
""" Arguments: layer_dims -- python array (list) containing the size of each layer. Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": W1 -- weight matrix of shape (layers_dims[1], layers_dims[0]) b1 -- bias vector of shape (layers_dims[1], 1) ... WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL -- bias vector of shape (layers_dims[L], 1) """
np.random.seed(3)
parameters = {}
L = len(layers_dims) # integer representing the number of layers
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1]) * np.sqrt(1 / layers_dims[l - 1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
return parameters
复制代码
来看下Xavier initialization后每层的激活函数输出值的分布:
为了解决上面的问题,提出了一种针对ReLU的初始化方法,通常称做 He initialization。初始化方式为:
def initialize_parameters_he(layers_dims):
""" Arguments: layer_dims -- python array (list) containing the size of each layer. Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": W1 -- weight matrix of shape (layers_dims[1], layers_dims[0]) b1 -- bias vector of shape (layers_dims[1], 1) ... WL -- weight matrix of shape (layers_dims[L], layers_dims[L-1]) bL -- bias vector of shape (layers_dims[L], 1) """
np.random.seed(3)
parameters = {}
L = len(layers_dims) # integer representing the number of layers
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l - 1]) * np.sqrt(2 / layers_dims[l - 1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
return parameters
复制代码
来看看通过He initialization后,当隐藏层使用ReLU时,激活函数的输出值的分布状况:
1.激活函数:
tanh、softsign好于sigmoid用
参考文献