前面几篇所讲的都是围绕神经网络展开的,一个标志就是激活函数非线性;在前人的研究中,也存在线性激活函数的稀疏编码,该方法试图直接学习数据的特征集,利用与此特征集相应的基向量,将学习获得的特征集从特征空间转换到样本数据空间,这样能够用特征集重构样本数据。php
数据集、特征集、基向量分别表示为\(x、A、s\).构造以下目标代价函数,对估计偏差的代价采用二阶范数,对稀疏性因子的惩罚代价采用一阶范数。原文中没有对偏差项在数据集上作平均,真实状况下都会除以数据集个数\(m\).git
\[J(A,s)= \frac 1m||As-x||_2^2+\lambda||s||\]github
接下来,原文中解释说为了加强稀疏性约束,避免取\(A\)的缩放值也能得到相同的代价,稀疏性惩罚须要考虑特征集的数值,进一步改进代价函数为算法
\[J(A,s)= \frac 1m||As-x||_2^2+\lambda||s||+\gamma ||A||_2^2\]网络
代价函数仍然存在L1范数的0点不可微问题,经过近似平滑解决,定义常量值平滑参数\(\epsilon\), 将代价函数变成函数
\[J(A,s)= \frac 1m||As-x||_2^2+\lambda \sum_k \sqrt{s_k^2+\epsilon} +\gamma ||A||_2^2\]学习
因为本代价函数非凸,而固定任意一个变量以后,代价函数是凸函数,因此能够经过交替固定\(A,s\)来求最优化代价的\(A,s\). 理论上对上式最优化取得的特征集与经过稀疏自编码学习获得的特征集差很少,人类视觉神经具备一种特色,大脑皮层 V1 区神经元可以按特定的方向对边缘进行检测,同时,这些神经元(在生理上)被组织成超柱(hypercolumns),在超柱中,相邻神经元以类似的方向对边缘进行检测,一个神经元检测水平边缘,其相邻神经元检测到的边缘就稍微偏离水平方向。为了使算法达到这种 拓扑性 ,也就是说相邻的特征激活具备必定连续性、平滑性,咱们将惩罚项也改形成考虑相邻特征值,在2X2分组状况下,把原来 \(\sqrt{s_{1,1}^2+\epsilon}\)这一项换成 \(\sqrt{s_{1,1}^2+s_{1,2}^2+s_{2,1}^2 +s_{2,2}^2+ \epsilon}\) 。获得拓扑稀疏编码的代价函数为优化
\[J(A,s)= \frac 1m||As-x||_2^2+\lambda \sum_{all G} \sqrt{ \sum_{s \in G}s^2+\epsilon} +\gamma ||A||_2^2\]编码
进一步用分组矩阵G来表示邻域分组规则,\(G_{r,c}=1\)表示第r组包含第c个特征,目标函数重写为spa
\[J(A,s)= \frac 1m||As-x||_2^2+\lambda \sum \sqrt{ Vss^T+\epsilon} +\gamma ||A||_2^2\]
以上符号都是很是抽象意义上的变量,矩阵化实现时就须要考虑清楚每行每列的操做是否是按照预设的每一项运算规则实现的,原文中没有这部份内容,我也花费了一番功夫推导。
按照前文所说的交替求\(A,s\)最优化策略,咱们须要先推导代价函数对\(A,s\)的偏导。设定矩阵表示展开为\(A=[_{Wj,f}]_{visibleSize \times featureSize}\). \(s=[S_{Wj,f}]_{featureSize\times m}\). 令\(V=visibleSize, F=featureSize\).
代价的一阶范数项对\(A\)求偏导为0.
\[\frac {\nabla J(A,s)} {W_{j,f}} =\frac 1 m \sum _i^m 2[W_{j,1}S_{1,i}+W_{j,2}S_{2,i}+…W_{j,F}S_{F,i} -x_{j,i}]S_{f,i}+ 2\gamma W_{j,f}\]
单向合并成矩阵表示为
\[\frac {\nabla J(A,s)} {A} = \frac 2 m (As-x)s^T +2\gamma A \]
同时咱们发现此表达式为一阶方程,能够获得代价函数取极小值时的\(A\)。可得s固定时使代价函数最小的\(A\)为
即\[min J(A,s) \Leftrightarrow A = \frac {xs^T} {ssT+m \gamma I}; \] .
展开代价函数并对\(s\)求解,
\[\begin{align} \frac {\nabla J(A,s)} { S_{f,i}} &= \frac 1 m \sum _j^V 2[W_{j,1}S_{1,i}+W_{j,2}S_{2,i}+…W_{j,F}S_{F,i} -x_{j,i}]W_{j,f}+ \frac {\nabla \lambda \sum_f^F \sum_i^m \sqrt {Gss^T+\epsilon }} {\nabla S_{f,i}} \\ &= \frac 1 m \sum _j^V 2[W_{j,1}S_{1,i}+W_{j,2}S_{2,i}+…W_{j,F}S_{F,i} -x_{j,i}]W_{j,f} + \lambda S_{f,i}\sum_l^F{\frac {g_{l,f}} {S\_smooth_{x,f}}} \end{align}\]
其中\(G=[g_{l,f}]_{F \times F}\) ,\(g_{l,f}=1\)表示第\(l\)组包含第f个特征。 S_smooth表示根据拓扑编码要求,对特征值的邻域进行平滑后的特征矩阵。
进行矩阵化改写,能够获得,两个求和式能够分别写成矩阵乘法:
\[ \frac {\nabla J(A,s)} S = \frac 2 m A^T(As-x) + \lambda S \cdot (G^T {(1./ S\_smooth)})\]
这个矩阵表达式不能获得使代价函数最小的\(S\)解析式,这个最优化过程须要使用迭代的方式得到,可使用梯度降低这类最优化方法。
至此咱们获得了编写代码须要的全部矩阵化表达。
在本节实践实例中,主文件是 sparseCodingExercise.m ,对\(A,s\)的代价梯度计算模块分别是 sparseCodingWeightCost.m、sparseCodingFeatureCost.m. 按照上述矩阵推导分别填充其中的公式部分,所有代码见https://github.com/codgeek/deeplearning。
分别固定\(A,s\)进行最优化的步骤在sparseCodingExercise.m
中,有几条须要注意的地方,不然将会很难训练出结果。
featureMatrix = weightMatrix'*batchPatches;
sampleIMAGES.m
中不能调用归一化normalizeData。和稀疏自编码不一样。weightMatrix = (batchPatches*(featureMatrix'))/(featureMatrix*(featureMatrix')+gamma*batchNumPatches*eye(numFeatures));
.两个梯度公式代码以下。
function [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix) %sparseCodingWeightCost - given the features in featureMatrix, % computes the cost and gradient with respect to % the weights, given in weightMatrix % parameters % weightMatrix - the weight matrix. weightMatrix(:, c) is the cth basis % vector. % featureMatrix - the feature matrix. featureMatrix(:, c) is the features % for the cth example % visibleSize - number of pixels in the patches % numFeatures - number of features % patches - patches % gamma - weight decay parameter (on weightMatrix) % lambda - L1 sparsity weight (on featureMatrix) % epsilon - L1 sparsity epsilon % groupMatrix - the grouping matrix. groupMatrix(r, :) indicates the % features included in the rth group. groupMatrix(r, c) % is 1 if the cth feature is in the rth group and 0 % otherwise. if exist('groupMatrix', 'var') assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension'); else groupMatrix = eye(numFeatures); end numExamples = size(patches, 2); weightMatrix = reshape(weightMatrix, visibleSize, numFeatures); featureMatrix = reshape(featureMatrix, numFeatures, numExamples); % -------------------- YOUR CODE HERE -------------------- % Instructions: % Write code to compute the cost and gradient with respect to the % weights given in weightMatrix. % -------------------- YOUR CODE HERE -------------------- linearError = weightMatrix * featureMatrix - patches; normError = sum(sum(linearError .* linearError))./numExamples;% 公式中代价项是二阶范数的平方,因此不用在开方 normWeight = sum(sum(weightMatrix .* weightMatrix)); topoFeature = groupMatrix*(featureMatrix.*featureMatrix); smoothFeature = sqrt(topoFeature + epsilon); costFeature = sum(sum(smoothFeature));% L1 范数为sum(|x|),对x加上平滑参数后,sum(sqrt(x2+epsilon)).容易错写为sqrt(sum(x2+epsilon))实际是L2范数 % cost = normError + gamma.*normWeight; cost = normError + lambda.*costFeature + gamma.*normWeight; grad = 2./numExamples.*(linearError*featureMatrix') + (2*gamma) .* weightMatrix; % grad = 2.*(weightMatrix*featureMatrix - patches)*featureMatrix' + 2.*gamma*weightMatrix; grad = grad(:); end
function [cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix) %sparseCodingFeatureCost - given the weights in weightMatrix, % computes the cost and gradient with respect to % the features, given in featureMatrix % parameters % weightMatrix - the weight matrix. weightMatrix(:, c) is the cth basis % vector. % featureMatrix - the feature matrix. featureMatrix(:, c) is the features % for the cth example % visibleSize - number of pixels in the patches % numFeatures - number of features % patches - patches % gamma - weight decay parameter (on weightMatrix) % lambda - L1 sparsity weight (on featureMatrix) % epsilon - L1 sparsity epsilon % groupMatrix - the grouping matrix. groupMatrix(r, :) indicates the % features included in the rth group. groupMatrix(r, c) % is 1 if the cth feature is in the rth group and 0 % otherwise. if exist('groupMatrix', 'var') assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension'); else groupMatrix = eye(numFeatures); end numExamples = size(patches, 2); weightMatrix = reshape(weightMatrix, visibleSize, numFeatures); featureMatrix = reshape(featureMatrix, numFeatures, numExamples); linearError = weightMatrix * featureMatrix - patches; normError = sum(sum(linearError .* linearError))./numExamples; normWeight = sum(sum(weightMatrix .* weightMatrix)); topoFeature = groupMatrix*(featureMatrix.*featureMatrix); smoothFeature = sqrt(topoFeature + epsilon); costFeature = sum(sum(smoothFeature));% L1 范数为sum(|x|),对x加上平滑参数后,sum(sqrt(x2+epsilon)).容易错写为sqrt(sum(x2+epsilon))实际是L2范数 cost = normError + lambda.*costFeature + gamma.*normWeight; grad = 2./numExamples.*(weightMatrix' * linearError) + lambda.*featureMatrix.*( (groupMatrix')*(1 ./ smoothFeature) );% 不止(f,i)本项偏导非零,(f-1,i)……,groupMatrix第f列不为0的全部行对应项都有s(f,i)的偏导 grad = grad(:); end
数据来源仍是稀疏自编码一节所用的图片, 设定特征层包含121个节点,输入层为8X8patch即64个节点,拓扑邻域为3X3的方阵,运行200次训练,
当输入值为对应特征值时,每一个激活值会有最大响应,因此把A矩阵每一行的64个向量还原成8*8的图片patch,也就是特征值了,每一个隐藏层对应一个,总共121个。结果以下图. 可看出在当前参数下,相同迭代次数,cg算法的图片特征更加清晰。
lbfgs | cg |
---|---|
![]() |
![]() |
为了看到更精细的训练结果,增长特征层以及输入层节点数,特征层采用256个节点,输入层分别试验了14X14以及15X15,相应须要增长拓扑邻域的大小,采用5X5的方阵。迭代算法采用cg。特征的清晰程度以及拓扑结构的完整性已经和示例中的结果无差异。边缘特征有序排列。而当把输入节点个数增长到16X16, 训练效果出现恶化,边缘特征开始变得模糊,缘由也能够理解,特征层已经再也不大于输入层,超完备基的条件不成立了,获得的训练效果也相对变差。
14X14输入节点,拓扑5X5 | 15X15输入节点,拓扑5X5 |
---|---|
![]() |
![]() |
增长输入节点的结果:
16X16输入节点,拓扑3X3 | 16X16输入节点,拓扑5X5 |
---|---|
![]() |
![]() |