数据处理--One Hot Encoding

时间 2019-11-19

原文原文链接

忘记从哪几位大神哪搜罗来的了web

One Hot Encoding 又称为一位有效编码，主要采用N位状态寄存器来对N个状态进行编码，每一个状态都有独立的寄存位，而且在任意时候都只有一位有效。算法

例如：浏览器

天然状态码为： 000， 001， 010， 100， 101 网络

One-Hot 编码为： 000001， 000010， 000100， 010000， 100000app

对于每个特征，若是它有m个可能值，那么通过One—Hot编码后，就变成了 m个二元特征（例如成绩全部可能值为好、中、差 — > 其 one-hot 编码后结果为 100， 010， 001）；这些特征互斥，每次只有一个激活，所以这样编码后数据会变的很稀疏。dom

可是这样作的好处是：机器学习

解决了分类器很差处理属性数据的问题
在必定程度上也起到了扩充特征的做用

例子：Scikit-learn 的例子ide

from sklearn.preprocessing import OneHotEncoder学习

def main():ui

print('program start')

enc = OneHotEncoder()

enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

array = enc.transform([[0, 1, 3]]).toarray()

print(array) # [[1. 0. 0. 1. 0. 0. 0. 0. 1.]]

print('program end')

if __name__ == '__main__':

main()

Note： fit 了 4个数据的3个特征， transform了1个数据3个特征，第一个特征两种值（0:10， 1:01），第二个特征三种值（0:100， 1: 010， 2:001）第三个特征四种值（0:1000，1:0100，2:0010， 3:0001）。因此转换[0,1,3] 为[1.0.0,1,0,0,0,1]

在实际机器学习应用任务中，有不少特征的的特征值（feature）并非连续的，例如性别的取值为’male’ & ‘female’。因此对于这样的特征，一般咱们须要对其进行特征数字化，

以下面的例子：

性别：[‘male’, ‘female’]

地区：[‘Europe’, ‘US’, ‘Asia’]

浏览器：[‘Firefox’, ‘Chrome’, ‘Safari’, ‘Internet Explorer’]

对于某一个样本，如[‘male’, ‘US’, ‘Internet Explorer’] , 咱们须要将这个分类值的特征数字化，最直接的方法是采用序列化的方式：[0,1,3] ，同理 [‘female’, ‘Asia’, ‘Chrome’] 序列化编码是[1,2,1]。可是这样的特征不能直接放入机器学习的算法中。

由于，分类器的默认数据是连续的，而且是有序的。可是，按照上述表示，数字并非有序的，而是随机分配的。

为何要求数据是连续的？和要求是有序的，

One Hot Encoding 的处理方法

对于以上问题，性别的属性是二维的，地区是三维的，同理浏览器是四维的。这样，咱们采用One-Hot编码的方式，对上述的样本 [‘male’, ‘US’, ‘Internet Explorer’] 进行编码，那么 ‘male’ 对应 [0,1] ;‘US’ —> [0,1,0]; ‘Internet Explorer’ —> [0,0,0,1] 最终完整的特征数字化的结果为：[0,1,0,1,0,0,0,0,1] , 注意： ⚠️ 使用这样编码的后果是数据会变得很是的稀疏。

为何使用one-hot编码来处理离散型特征?

一、Why do we binarize categorical features?

We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot编码，将离散特征的取值扩展到了欧式空间，离散特征的某个取值就对应欧式空间的某个点。

二、Why do we embed the feature vectors in the Euclidean space?

Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.将离散特征经过one-hot编码映射到欧式空间，是由于，在回归，分类，聚类等机器学习算法中，特征之间距离的计算或类似度的计算是很是重要的，而咱们经常使用的距离或类似度的计算都是在欧式空间的类似度计算，计算余弦类似性，基于的就是欧式空间。

三、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?

Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.

Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn't it fair to assume that all categorical features are equally far away from each other?

Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.

将离散型特征使用one-hot编码，确实会让特征之间的距离计算更加合理。好比，有一个离散型特征，表明工做类型，该离散型特征，共有三个取值，不使用one-hot编码，其表示分别是x_1 = (1), x_2 = (2), x_3 = (3)。两个工做之间的距离是，(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那么x_1和x_3工做之间就越不类似吗？显然这样的表示，计算出来的特征的距离是不合理。那若是使用one-hot编码，则获得x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1)，那么两个工做之间的距离就都是sqrt(2).即每两个工做之间的距离是同样的，显得更合理。

四、About the original question?

Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.

五、Are there cases when we can avoid doing binarization?不必用one-hot 编码的情形

Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then 1] note that the rank feature is categorical, 2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don't have to binarize the categorical rank feature.

More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don't have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.

将离散型特征进行one-hot编码的做用，是为了让距离计算更合理，但若是特征是离散的，而且不用one-hot编码就能够很合理的计算出距离，那么就不必进行one-hot编码，好比，该离散特征共有1000个取值，咱们分红两组，分别是400和600,两个小组之间的距离有合适的定义，组内的距离也有合适的定义，那就不必用one-hot 编码。

离散特征进行one-hot编码后，编码后的特征，其实每一维度的特征均可以看作是连续的特征。就能够跟对连续型特征的归一化方法同样，对每一维特征进行归一化。好比归一化到[-1,1]或归一化到均值为0,方差为1。

有些状况不须要进行特征的归一化：

It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。

基于树的方法是不须要进行特征的归一化，例如随机森林，bagging 和 boosting等。基于参数的模型或基于距离的模型，都是要进行特征的归一化。

one-hot编码为何能够解决类别型数据的离散值问题

首先，one-hot编码是N位状态寄存器为N个状态进行编码的方式

eg：高、中、低不可分，→ 用0 0 0 三位编码以后变得可分了，而且成为互相独立的事件

→ 相似 SVM中，本来线性不可分的特征，通过project以后到高维以后变得可分了 GBDT处理高维稀疏矩阵的时候效果并很差，即便是低维的稀疏矩阵也未必比SVM好

Tree Model不太须要one-hot编码

对于决策树来讲，one-hot的本质是增长树的深度

tree-model是在动态的过程当中生成相似 One-Hot + Feature Crossing 的机制

1. 一个特征或者多个特征最终转换成一个叶子节点做为编码，one-hot能够理解成三个独立事件

2. 决策树是没有特征大小的概念的，只有特征处于他分布的哪一部分的概念

one-hot能够解决线性可分问题可是比不上label econding

one-hot降维后的缺点：

降维前能够交叉的降维后可能变得不能交叉

树模型的训练过程：

从根节点到叶子节点整条路中有多少个节点至关于交叉了多少次，因此树的模型是自行交叉

eg：是不是长的 { 否（是→ 柚子，否 → 苹果），是 → 香蕉 } 园 cross 黄 → 形状（圆，长）颜色（黄，红） one-hot度为4的样本

使用树模型的叶子节点做为特征集交叉结果能够减小没必要要的特征交叉的操做或者减小维度和degree候选集

eg 2 degree → 8的特征向量树 → 3个叶子节点

树模型：Ont-Hot + 高degree笛卡尔积 + lasso 要消耗更少的计算量和计算资源

这就是为何树模型以后能够stack线性模型

n*m的输入样本 → 决策树训练以后能够知道在哪个叶子节点上 → 输出叶子节点的index → 变成一个n*1的矩阵 → one-hot编码 → 能够获得一个n*o的矩阵（o是叶子节点的个数） → 训练一个线性模型

典型的使用： GBDT +　ＲＦ

优势：节省作特征交叉的时间和空间

若是只使用one-hot训练模型，特征之间是独立的

对于现有模型的理解：（G（l（张量）））：

其中：l（·）为节点的模型

G（·）为节点的拓扑方式

神经网络：l（·）取逻辑回归模型

G（·）取全链接的方式

决策树： l（·）取LR

G（·）取树形连接方式

创新点： l（·）取 NB，SVM 单层NN ，等

G（·）取怎样的信息传递方式