高级特征工程II

如下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers课程笔记。html

Statistics and distance based features

该部分专一于此高级特征工程:计算由另外一个分组的一个特征的各类统计数据和从给定点的邻域分析获得的特征。python

groupby and nearest neighbor methodsgit

例子:这里有一些CTR任务的数据

statistic_ctr_data.png

咱们能够暗示广告有 页面上的最低价格将吸引大部分注意力。 页面上的其余广告不会颇有吸引力。 计算与这种含义相关的特征很是容易。 咱们能够为每一个广告的每一个用户和网页添加最低和最高价格。 在这种状况下,具备最低价格的广告的位置也可使用。github

statistic_ctr_data2.png

代码实现
statistic_ctr_data_code.pngspring

  • More feature
  • How many pages user visited
  • Standard deviation of prices
  • Most visited page
  • Many, many more

若是没有特征能够像这样使用groupby呢?可使用最近邻点app

Neighbors

  • Explicit group is not needed
  • More flexible
  • Much harder to implement

Examplesless

  • Number of houses in 500m, 1000m,..
  • Average price per square meter in 500m, 1000m,..
  • Number of schools/supermarkets/parking lots in 500m, 1000m,..
  • Distance to colsest subway station

讲师在Springleaf比赛中使用了它。ide

KNN features in springleaf

  • Mean encode all the variables
  • For every point, find 2000 nearst neighbors using Bray-Curtis metric
    $$\frac{\sum{|u_i - v_i|}}{\sum{|u_i + v_i|}}$$
  • Calculate various features from those 2000 neighbors

Evaluate学习

  • Mean target of neatrest 5,10,15,500,2000, neighbors
  • Mean distance to 10 closest neighbors
  • Mean distance to 10 closest neighbors with target 1
  • Mean distance to 10 closest neighbors with target 0

Matrix factorizations for feature extraction

  • Example of feature fusion
    fusion.png

Notes about Matrix Fatorization

  • Can be apply only for some columns
  • Can provide additional diversity
  • Good for ensembles
  • It is lossy transformation.Its' efficirncy depends on:
  • Particular task
  • Number of latent factors
    • Usually 5-100

Implementtation

  • Serveral MF methods you can find in sklearn
  • SVD and PCA
  • Standart tools for Matrix Fatorization
  • TruncatedSVD
  • Works with sparse matrices
  • Non-negative Matrix Fatorization(NMF)
  • Ensures that all latent fators are non-negative
  • Good for counts-like data

NMF for tree-based methods

non-negative matrix factorization简称NMF,它以一种使数据更适合决策树的方式转换数据。
NMF.pngflex

能够看出,NMF变换数据造成平行于轴的线。

因子分解

可使用与线性模型的技巧来分解矩阵。
NMF_note.png

Conclusion

  • Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
  • It can be applied for transforming categorical features into real-valued
  • Many of tricks trick suitable for linear models can be useful for MF

Feature interactions

特征值的全部组合

假设咱们正在构建一个预测模型,在网站上显示的最佳广告横幅。

... category_ad category_site ... is_clicked
... auto_part game_news ... 0
... music_tickets music_news .. 1
... mobile_phones auto_blog ... 0

将广告横幅自己的类别和横幅将显示的网站类别,进行组合将构成一个很是强的特征。

... ad_site ... is_clicked
... auto_part | game_news ... 0
... music_tickets | music_news .. 1
... mobile_phones | auto_blog ... 0

构建这两个特征的组合特征ad_site

从技术角度来看, 有两种方法能够构建这种交互。

  • Example of interactions

方法1
interaction1.png

方法2
interaction2.png

  • 类似的想法也可用于数值变量
    interge_interaction.png

事实上,这不限于乘法操做,还能够是其余的

  • Multiplication
  • Sum
  • Diff
  • Division
  • ..

Practival Notes

  • We have a lot of possible interactions -N*N for N features.
  • a. Even more if use several types in interactions
  • Need ti reduce it's number
  • a. Dimensionality reduction
  • b. Feature selection

经过这种方法生成了大量的特征,可使用特征选择或降维的方法减小特征。如下用特征选择举例说明
sele.png

Interactions' order

  • We looked at 2nd order interactions.
  • Such approach can be generalized for higher orders.
  • It is hard to do generation and selection automatically.
  • Manual building of high-order interactions is some kind of art.

Extract features from DT

tree_interaction.png

看一下决策树。 让咱们将每一个叶子映射成二进制特征。 对象叶子的索引能够用做新分类特征的值。 若是咱们不使用单个树而是使用它们的总体。 例如,随机森林, 那么这种操做能够应用于每一个条目。 这是一种提取高阶交互的强大方法。

  • How to use it

In sklearn:

tree_model.apply()

In xgboost:

booster.predict(pred_leaf=True)

Conclusion

  • We looked at ways to build an interaction of categorical attributes
  • Extended this approach to real-valued features
  • Learn how to extract features via decision trees

t-SNE

用于探索数据分析。能够被视为从数据中获取特征的方法。

Practical Notes

  • Result heavily depends on hyperparameters(perplexity)
  • Good practice is to use several projections with different perplexities(5-100)
  • Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams
  • Train and test should be projected together
  • tSNE runs for a long time with a big number of features
  • it is common to do dimensionality reduction before projection.
  • Implementation of tSNE can be found in sklearn library.
  • But personally I perfer you use stand-alone implementation python package tsne due to its' faster speed.

Conclusion

  • tSNE is a great tool for visualization
  • It can be used as feature as well
  • Be careful with interpretation of results
  • Try different perplexities

矩阵分解:

相关文章
相关标签/搜索