优化算法与特征缩放

时间 2019-11-06

标签优化算法特征缩放繁體版

原文原文链接

特征缩放
目的
因为原始数据的值范围变化很大，在一些机器学习算法中，若是没有标准化，目标函数将没法正常工做。例如，大多数分类器按欧几里德距离计算两点之间的距离。若是其中一个要素具备宽范围的值，则距离将受此特定要素的控制。所以，应对全部特征的范围进行归一化，以使每一个特征大体与最终距离成比例。
应用特征缩放的另外一个缘由是梯度降低与特征缩放比没有它时收敛得快得多。

In statistics and applications of statistics, normalization can have a range of meanings.[1] In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment. In the case of normalization of scores in educational assessment, there may be an intention to align distributions to a normal distribution. A different approach to normalization of probability distributions is quantile normalization, where the quantiles of the different measures are brought into alignment.

In another usage in statistics, normalization refers to the creation of shifted and scaled versions of statistics, where the intention is that these normalized values allow the comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influences, as in an anomaly time series. Some types of normalization involve only a rescaling, to arrive at values relative to some size variable. In terms of levels of measurement, such ratios only make sense for ratio measurements (where ratios of measurements are meaningful), not interval measurements (where only distances are meaningful, but not ratios).

In theoretical statistics, parametric normalization can often lead to pivotal quantities – functions whose sampling distribution does not depend on the parameters – and to ancillary statistics – pivotal quantities that can be computed from observations, without knowing parameters.

在统计学和统计学应用中，规范化能够有一系列含义。[1]在最简单的状况下，打分的标准化意味着在取平均前将不一样量纲的观测值调整到统一量纲下。
在更复杂的状况下，归一化能够指更复杂的调整，主要目的是要将不一样维度的数据在分布层面作到几率对齐。
通常场景下归一化会有意将分布与正态分布对齐，好比教育评估、身高统计。
几率分布归一化的差别点每每是分位数归一化，标准化过程当中不一样维度间的分位数将被对齐。

在统计学的另外一种用法中，归一化是指将维度、数据进行缩放、转换，
经过比较不一样相关维度间标准化结果，缩放到同一空间中，达到消除不一样数据之间异常分布致使的不良影响，相似的时序数据中的异常分布也会被消除。

某些类型的规范化仅涉及从新缩放，以得到相对于某个大小变量的值。就测量水平而言，这样的比率仅对比率测量（其中测量的比率是有意义的）有意义，而不是间隔测量（其中仅距离是有意义的，而不是比率）。

在理论统计中，参数标准化一般能够做用于两种状况：关键量和辅助统计；
其中关键量是采样分布不依赖于参数的函数；辅助统计表明在不知道参数配置的状况下关键量由观测值计算推导。

AdaGrad

学习率逐参数的除以历史梯度平方和的平方根，使得每一个参数的学习率不一样 

1.简单来说，设置全局学习率以后，每次经过，全局学习率逐参数的除以历史梯度平方和的平方根，使得每一个参数的学习率不一样

2.效果是：在参数空间更为平缓的方向，会取得更大的进步（由于平缓，因此历史梯度平方和较小，对应学习降低的幅度较小）

3.缺点是,使得学习率过早，过量的减小

4.在某些模型上效果不错。

Karpathy作了一个这几个方法在MNIST上性能的比较，其结论是： 
adagrad相比于sgd和momentum更加稳定，即不须要怎么调参。而精调的sgd和momentum系列方法不管是收敛速度仍是precision都比adagrad要好一些。在精调参数下，通常Nesterov优于momentum优于sgd。而adagrad一方面不用怎么调参，另外一方面其性能稳定优于其余方法。