在第一篇 UFLDL深度学习笔记 (一)基本知识与稀疏自编码中讨论了激活函数为\(sigmoid\)函数的系数自编码网络,本文要讨论“UFLDL 线性解码器”,区别在于输出层去掉了\(sigmoid\),将计算值\(z\)直接做为输出。线性输出的缘由是为了不对输入范围的缩放:php
S 型激励函数输出范围是 [0,1],当$ f(z^{(3)}) $采用该激励函数时,就要对输入限制或缩放,使其位于 [0,1] 范围中。一些数据集,好比 MNIST,能方便将输出缩放到 [0,1] 中,可是很难知足对输入值的要求。好比, PCA 白化处理的输入并不知足 [0,1] 范围要求,也不清楚是否有最好的办法能够将数据缩放到特定范围中。html
既然改变了输出层激活函数,能够想到须要对其残差、偏导公式关系从新推演。git
线性输出的神经网络仍然是三层,\(n_l=3\),自编码线性输出\(a_i^{(n_l)}\),则\(f'(z_i^{(n_l)})=1\),计算输出层残差:github
\[\begin{align} \delta_i^{(3)} &= -(y_i-a_i^{(n_l)})*f'(z_i^{(n_l)}) \\ &= -(y_i-a_i^{(n_l)}) \\ \end{align}\]网络
使用反向传播计算另外两层残差:函数
\[ \begin{align} \delta^{(2)} &= {W^{(2)}}^T*\delta^{(3)} .* f'(z_i^{(2)}) \\ &= {W^{(2)}}^T*\delta^{(3)} .*(a^{(2)}.*(1-a^{(2)})) \end{align} \]学习
根据梯度与残差矩阵的关系可得:this
\[\begin{align} \frac {\nabla J} {\nabla W^{(2)}} & =\frac 1 m \delta^{(3)}*a^{(2)} \\ \frac {\nabla J} {\nabla b^{(2)}} &=\frac 1 m\delta^{(3)} \end{align} \]编码
同理可求出:spa
\[ \begin{align} \delta^{(1)} &= {W^{(1)}}^T*\delta^{(2)} .* f'(z_i^{(1)}) \\ &= {W^{(1)}}^T*\delta^{(2)} .*(a^{(1)}.*(1-a^{(1)})) \end{align} \]
\[\begin{align} \frac {\nabla J} {\nabla W^{(1)}} & = \frac 1 m\delta^{(2)}*a^{(1)} \\ \frac {\nabla J} {\nabla b^{(1)}} &=\frac 1 m\delta^{(2)} \end{align} \]
这样就获得了线性解码器自编码网络代价函数对网络权值\(W^{(1)}, b^{(1)}; W^{(2)}, b^{(2)}\)的梯度。
根据前面的步骤描述,与稀疏自编码的区别仅仅是梯度公式形式的差别,基本流程以及惩罚项、稀疏性约束彻底复用稀疏自编码的要求。须要增长的模块是代价函数与梯度计算模块sparseAutoencoderLinearCost.m
,详见https://github.com/codgeek/deeplearning
function [cost,grad] = sparseAutoencoderLinearCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter % sparsityParam: The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: weight of sparsity penalty term % data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); %% ---------- YOUR CODE HERE -------------------------------------- % forward propagation [~, m] = size(data); % visibleSize×N_samples, m=N_samples a2 = sigmoid(W1*data + b1*ones(1,m));% active value of hiddenlayer: hiddenSize×N_samples a3 = W2*a2 + b2*ones(1,m);% liner decoder would output Z. output result: visibleSize×N_samples diff = a3 - data; penalty = mean(a2, 2); % measure of hiddenlayer active: hiddenSize×1 residualPenalty = (-sparsityParam./penalty + (1 - sparsityParam)./(1 - penalty)).*beta; % penalty factor in residual error delta2 % size(residualPenalty) cost = sum(sum((diff.*diff)))./(2*m) + ... (sum(sum(W1.*W1)) + sum(sum(W2.*W2))).*lambda./2 + ... beta.*sum(KLdivergence(sparsityParam, penalty)); % back propagation delta3 = -(data-a3); % liner decoder: visibleSize×N_samples delta2 = (W2'*delta3 + residualPenalty*ones(1, m)).*(a2.*(1-a2)); % hiddenSize×N_samples. !!! => W2'*delta3 not W1'*delta3 W2grad = (a2*(delta3'))'; % ▽J(L)=delta(L+1,i)*a(l,j). sum of grade value from N_samples is got by matrix product hiddenSize×N_samples * N_samples×visibleSize. so mean value is caculated by "/N_samples" W1grad = (data*(delta2'))';% matrix product visibleSize×N_samples * N_samples×hiddenSize b1grad = sum(delta2, 2); b2grad = sum(delta3, 2); % mean value across N_sample W1grad=W1grad./m + lambda.*W1; W2grad=W2grad./m + lambda.*W2; b1grad=b1grad./m; b2grad=b2grad./m;% mean value across N_sample: visibleSize ×1 grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; end function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end function value = KLdivergence(pmean, p) value = pmean.*log(pmean./p) + (1- pmean).*log((1 - pmean)./( 1 - p)); end
数据集来自STL-10 dataset. 须要注意的是咱们使用的是下采样以后的图片,每张图片为8X8的彩色图片;另外也原始数据须要作ZCA白化处理,得益于matlab丰富的库函数,svd分解、白化等每一个步骤只须要单行代码便可完成。
% Apply ZCA whitening sigma = patches * patches' / numPatches; [u, s, v] = svd(sigma); ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u'; patches = ZCAWhite * patches;
STL-10 原始图片下采样到8X8像素图片
设定与练习说明相同的参数,STL10数据为8X8像素的彩色图片,因此输入层是192个单元,隐藏层设定400个节点,输出层一样是192个节点。运行代码主文件linearDecoderExercise.m 能够学习到彩色图片特征,如上图所示,本节只是将数据提取为特征,并不进行进一步分类,特征数据留给后续的卷积神经网络使用。