Sparse Coding

时间 2019-11-11

标签 sparse coding 繁體版

原文原文链接

Sparse Coding

Sparse coding is a class of unsupervised methods for learning sets of over-complete bases to represent data efficiently. —— 过完备的基，无监督 The aim of sparse coding is to find a set of basis vectors $\mathbf{\phi}_i$ such that we can represent an input vector $\mathbf{x}$ as a linear combination of these basis vectors:算法

$\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} \end{align}$

While techniques such as Principal Component Analysis (PCA) allow us to learn a complete set of basis vectors efficiently, we wish to learn an
over-complete
set of basis vectors to represent input vectors
$\mathbf{x}\in\mathbb{R}^n$
(i.e. such that
k
>
n
).
The advantage of having an over-complete basis is that our basis vectors are better able to capture structures and patterns inherent in the input data.
However, with an over-complete basis,
the coefficients a_i are no longer uniquely determined
by the input vector
$\mathbf{x}$
. Therefore, in sparse coding, we introduce the additional criterion of sparsity t

o resolve the degeneracy introduced by over-completeness.

过完备的基能发掘出数据的内在结构和模型，可是其系数表示将不惟一app

Here, we define sparsity as having few non-zero components or having few components not close to zero. The requirement that our coefficients a_i be sparse means that given a input vector, we would like as few of our coefficients to be far from zero as possible. The choice of sparsity as a desired characteristic of our representation of the input data can be motivated by the observation that most sensory data such as natural images may be described as the superposition of a small number of atomic elements such as surfaces or edges. 原子图像的叠加 Other justifications such as comparisons to the properties of the primary visual cortex have also been advanced.less

We define the sparse coding cost function on a set of m input vectors aside

$\begin{align} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$

where S(.) is a sparsity cost function which penalizes a_i for being far from zero.函数

稀疏惩罚项学习

We can interpret the first term of the sparse coding objective as a reconstruction term which tries to force the algorithm to provide a good representation of $\mathbf{x}$ and the second term as a sparsity penalty which forces our representation of $\mathbf{x}$ to be sparse. The constant λ is a scaling constant to determine the relative importance of these two contributions.测试

Although the most direct measure of sparsity is the "L₀" norm ( $S(a_i) = \mathbf{1}(|a_i|>0)$ ), it is non-differentiable （不可微） and difficult to optimize in general. In practice, common choices for the sparsity cost S(.) are the L₁ penalty $S(a_i)=\left|a_i\right|_1$ and the log penalty $S(a_i)=\log(1+a_i^2)$ .优化

In addition, it is also possible to make the sparsity penalty arbitrarily small by scaling down a_i and scaling $\mathbf{\phi}_i$ up by some large constant. To prevent this from happening, we will constrain $\left|\left|\mathbf{\phi}\right|\right|^2$ to be less than some constant C. The full sparse coding cost function including our constraint on $\mathbf{\phi}$ isui

$\begin{array}{rc} \text{minimize}_{a^{(j)}_i,\mathbf{\phi}_{i}} & \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \\ \text{subject to} & \left|\left|\mathbf{\phi}_i\right|\right|^2 \leq C, \forall i = 1,...,k \\ \end{array}$

Probabilistic Interpretation

到目前为止，咱们所考虑的稀疏编码，是为了寻找到一个稀疏的、超完备基向量集，来覆盖咱们的输入数据空间。如今换一种方式，咱们能够从几率的角度出发，将稀疏编码算法看成一种“生成模型”。this

咱们将天然图像建模问题当作是一种线性叠加，叠加元素包括 k 个独立的源特征 $\mathbf{\phi}_i$ 以及加性噪声 ν ：

$\begin{align} \mathbf{x} = \sum_{i=1}^k a_i \mathbf{\phi}_{i} + \nu(\mathbf{x}) \end{align}$

咱们的目标是找到一组特征基向量 $\mathbf{\phi}$ ，它使得图像的分布函数 $P(\mathbf{x}\mid\mathbf{\phi})$ 尽量地近似于输入数据的经验分布函数 $P^*(\mathbf{x})$ 。一种实现方式是，最小化 $P^*(\mathbf{x})$ 与 $P(\mathbf{x}\mid\mathbf{\phi})$ 之间的 KL 散度，此 KL 散度表示以下：

$\begin{align} D(P^*(\mathbf{x})||P(\mathbf{x}\mid\mathbf{\phi})) = \int P^*(\mathbf{x}) \log \left(\frac{P^*(\mathbf{x})}{P(\mathbf{x}\mid\mathbf{\phi})}\right)d\mathbf{x} \end{align}$

由于不管咱们如何选择 $\mathbf{\phi}$ ，经验分布函数 $P^*(\mathbf{x})$ 都是常量，也就是说咱们只须要最大化对数似然函数 $P(\mathbf{x}\mid\mathbf{\phi})$ 。假设 ν 是具备方差σ² 的高斯白噪音，则有下式：

$\begin{align} P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) = \frac{1}{Z} \exp\left(- \frac{(\mathbf{x}-\sum^{k}_{i=1} a_i \mathbf{\phi}_{i})^2}{2\sigma^2}\right) \end{align}$

为了肯定分布 $P(\mathbf{x}\mid\mathbf{\phi})$ ，咱们须要指定先验分布 $P(\mathbf{a})$ 。假定咱们的特征变量是独立的，咱们就能够将先验几率分解为：

$\begin{align} P(\mathbf{a}) = \prod_{i=1}^{k} P(a_i) \end{align}$

此时，咱们将“稀疏”假设加入进来——假设任何一幅图像都是由相对较少的一些源特征组合起来的。所以，咱们但愿 a_i 的几率分布在零值附近是凸起的，并且峰值很高。一个方便的参数化先验分布就是：

$\begin{align} P(a_i) = \frac{1}{Z}\exp(-\beta S(a_i)) \end{align}$

这里 S(a_i) 是决定先验分布的形状的函数。

当定义了 $P(\mathbf{x} \mid \mathbf{a} , \mathbf{\phi})$ 和 $P(\mathbf{a})$ 后，咱们就能够写出在由 $\mathbf{\phi}$ 定义的模型之下的数据 $\mathbf{x}$ 的几率分布：

$\begin{align} P(\mathbf{x} \mid \mathbf{\phi}) = \int P(\mathbf{x} \mid \mathbf{a}, \mathbf{\phi}) P(\mathbf{a}) d\mathbf{a} \end{align}$

那么，咱们的问题就简化为寻找：

$\begin{align} \mathbf{\phi}^*=\text{argmax}_{\mathbf{\phi}} < \log(P(\mathbf{x} \mid \mathbf{\phi})) > \end{align}$

这里 < . > 表示的是输入数据的指望值。

不幸的是，经过对 $\mathbf{a}$ 的积分计算 $P(\mathbf{x} \mid \mathbf{\phi})$ 一般是难以实现的。虽然如此，咱们注意到若是 $P(\mathbf{x} \mid \mathbf{\phi})$ 的分布（对于相应的 $\mathbf{a}$ ）足够陡峭的话，咱们就能够用 $P(\mathbf{x} \mid \mathbf{\phi})$ 的最大值来估算以上积分。估算方法以下：

$\begin{align} \mathbf{\phi}^{*'}=\text{argmax}_{\mathbf{\phi}} < \max_{\mathbf{a}} \log(P(\mathbf{x} \mid \mathbf{\phi})) > \end{align}$

跟以前同样，咱们能够经过减少 a_i 或增大 $\mathbf{\phi}$ 来增长几率的估算值（由于 P(a_i) 在零值附近陡升）。所以咱们要对特征向量 $\mathbf{\phi}$ 加一个限制以防止这种状况发生。

最后，咱们能够定义一种线性生成模型的能量函数，从而将原先的代价函数从新表述为：

$\begin{array}{rl} E\left( \mathbf{x} , \mathbf{a} \mid \mathbf{\phi} \right) & := -\log \left( P(\mathbf{x}\mid \mathbf{\phi},\mathbf{a}\right)P(\mathbf{a})) \\ &= \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{array}$

其中 λ = 2σ²β ，而且关系不大的常量已被隐藏起来。由于最大化对数似然函数等同于最小化能量函数，咱们就能够将原先的优化问题从新表述为：

$\begin{align} \mathbf{\phi}^{*},\mathbf{a}^{*}=\text{argmin}_{\mathbf{\phi},\mathbf{a}} \sum_{j=1}^{m} \left|\left| \mathbf{x}^{(j)} - \sum_{i=1}^k a^{(j)}_i \mathbf{\phi}_{i}\right|\right|^{2} + \lambda \sum_{i=1}^{k}S(a^{(j)}_i) \end{align}$

使用几率理论来分析，咱们能够发现，选择 L₁ 惩罚和 $\log(1+a_i^2)$ 惩罚做为函数 S(.) ，分别对应于使用了拉普拉斯几率 $P(a_i) \propto \exp\left(-\beta|a_i|\right)$ 和柯西先验几率 $P(a_i) \propto \frac{\beta}{1+a_i^2}$ 。

Learning

使用稀疏编码算法学习基向量集的方法，是由两个独立的优化过程组合起来的。第一个是逐个使用训练样本 $\mathbf{x}$ 来优化系数 a_i ，第二个是一次性处理多个样本对基向量 $\mathbf{\phi}$ 进行优化。

若是使用 L₁ 范式做为稀疏惩罚函数，对 $a^{(j)}_i$ 的学习过程就简化为求解由 L₁ 范式正则化的最小二乘法问题，这个问题函数在域 $a^{(j)}_i$ 内为凸，已经有不少技术方法来解决这个问题（诸如CVX之类的凸优化软件能够用来解决L1正则化的最小二乘法问题）。若是 S(.) 是可微的，好比是对数惩罚函数，则能够采用基于梯度算法的方法，如共轭梯度法。

用 L₂ 范式约束来学习基向量，一样能够简化为一个带有二次约束的最小二乘问题，其问题函数在域 $\mathbf{\phi}$ 内也为凸。标准的凸优化软件（如CVX）或其它迭代方法就能够用来求解 $\mathbf{\phi}$ ，虽然已经有了更有效的方法，好比求解拉格朗日对偶函数（Lagrange dual）。

根据前面的的描述，稀疏编码是有一个明显的局限性的，这就是即便已经学习获得一组基向量，若是为了对新的数据样本进行“编码”，咱们必须再次执行优化过程来获得所需的系数。这个显著的“实时”消耗意味着，即便是在测试中,实现稀疏编码也须要高昂的计算成本，尤为是与典型的前馈结构算法相比。