高级特征工程II

时间 2019-12-07

标签高级特征工程繁體版

原文原文链接

如下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers课程笔记。html

Statistics and distance based features

该部分专一于此高级特征工程：计算由另外一个分组的一个特征的各类统计数据和从给定点的邻域分析获得的特征。python

groupby and nearest neighbor methodsgit

例子：这里有一些CTR任务的数据

咱们能够暗示广告有页面上的最低价格将吸引大部分注意力。页面上的其余广告不会颇有吸引力。计算与这种含义相关的特征很是容易。咱们能够为每一个广告的每一个用户和网页添加最低和最高价格。在这种状况下，具备最低价格的广告的位置也可使用。github

代码实现
spring

More feature
How many pages user visited
Standard deviation of prices
Most visited page
Many, many more

若是没有特征能够像这样使用groupby呢？可使用最近邻点app

Neighbors

Explicit group is not needed
More flexible
Much harder to implement

Examplesless

Number of houses in 500m, 1000m,..
Average price per square meter in 500m, 1000m,..
Number of schools/supermarkets/parking lots in 500m, 1000m,..
Distance to colsest subway station

讲师在Springleaf比赛中使用了它。ide

KNN features in springleaf

Mean encode all the variables
For every point, find 2000 nearst neighbors using Bray-Curtis metric
$$\frac{\sum{|u_i - v_i|}}{\sum{|u_i + v_i|}}$$
Calculate various features from those 2000 neighbors

Evaluate学习

Mean target of neatrest 5,10,15,500,2000, neighbors
Mean distance to 10 closest neighbors
Mean distance to 10 closest neighbors with target 1
Mean distance to 10 closest neighbors with target 0

Matrix factorizations for feature extraction

Example of feature fusion

Notes about Matrix Fatorization

Can be apply only for some columns
Can provide additional diversity
Good for ensembles
It is lossy transformation.Its' efficirncy depends on:
Particular task
Number of latent factors
- Usually 5-100

Implementtation

Serveral MF methods you can find in sklearn
SVD and PCA
Standart tools for Matrix Fatorization
TruncatedSVD
Works with sparse matrices
Non-negative Matrix Fatorization(NMF)
Ensures that all latent fators are non-negative
Good for counts-like data

NMF for tree-based methods

non-negative matrix factorization简称NMF，它以一种使数据更适合决策树的方式转换数据。
flex

能够看出，NMF变换数据造成平行于轴的线。

因子分解

可使用与线性模型的技巧来分解矩阵。

Conclusion

Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
It can be applied for transforming categorical features into real-valued
Many of tricks trick suitable for linear models can be useful for MF

Feature interactions

特征值的全部组合

Example:banner selection

假设咱们正在构建一个预测模型，在网站上显示的最佳广告横幅。

...	category_ad	category_site	...	is_clicked
...	auto_part	game_news	...	0
...	music_tickets	music_news	..	1
...	mobile_phones	auto_blog	...	0

将广告横幅自己的类别和横幅将显示的网站类别，进行组合将构成一个很是强的特征。

...	ad_site	...	is_clicked
...	auto_part \| game_news	...	0
...	music_tickets \| music_news	..	1
...	mobile_phones \| auto_blog	...	0

构建这两个特征的组合特征ad_site

从技术角度来看，有两种方法能够构建这种交互。

Example of interactions

方法1

方法2

类似的想法也可用于数值变量

事实上，这不限于乘法操做，还能够是其余的

Multiplication
Sum
Diff
Division
..

Practival Notes

We have a lot of possible interactions -N*N for N features.
a. Even more if use several types in interactions
Need ti reduce it's number
a. Dimensionality reduction
b. Feature selection

经过这种方法生成了大量的特征，可使用特征选择或降维的方法减小特征。如下用特征选择举例说明

Interactions' order

We looked at 2nd order interactions.
Such approach can be generalized for higher orders.
It is hard to do generation and selection automatically.
Manual building of high-order interactions is some kind of art.

Extract features from DT

看一下决策树。让咱们将每一个叶子映射成二进制特征。对象叶子的索引能够用做新分类特征的值。若是咱们不使用单个树而是使用它们的总体。例如，随机森林，那么这种操做能够应用于每一个条目。这是一种提取高阶交互的强大方法。

How to use it

In sklearn:

tree_model.apply()

In xgboost:

booster.predict(pred_leaf=True)

Conclusion

We looked at ways to build an interaction of categorical attributes
Extended this approach to real-valued features
Learn how to extract features via decision trees

t-SNE

用于探索数据分析。能够被视为从数据中获取特征的方法。

Practical Notes

Result heavily depends on hyperparameters(perplexity)
Good practice is to use several projections with different perplexities(5-100)
Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams
Train and test should be projected together
tSNE runs for a long time with a big number of features
it is common to do dimensionality reduction before projection.
Implementation of tSNE can be found in sklearn library.
But personally I perfer you use stand-alone implementation python package tsne due to its' faster speed.

Conclusion

tSNE is a great tool for visualization
It can be used as feature as well
Be careful with interpretation of results
Try different perplexities

高级特征工程II

Statistics and distance based features

例子：这里有一些CTR任务的数据

Neighbors

KNN features in springleaf

Matrix factorizations for feature extraction

Notes about Matrix Fatorization

Implementtation

NMF for tree-based methods

因子分解

Conclusion

Feature interactions

Practival Notes

Interactions' order

Extract features from DT

Conclusion

t-SNE

Practical Notes

Conclusion

矩阵分解：

T-SNOW：

互动：