如下是Coursera上的How to Win a Data Science Competition: Learn from Top Kagglers课程笔记。html
Statistics and distance based features
该部分专一于此高级特征工程:计算由另外一个分组的一个特征的各类统计数据和从给定点的邻域分析获得的特征。python
groupby and nearest neighbor methodsgit
例子:这里有一些CTR任务的数据

咱们能够暗示广告有 页面上的最低价格将吸引大部分注意力。 页面上的其余广告不会颇有吸引力。 计算与这种含义相关的特征很是容易。 咱们能够为每一个广告的每一个用户和网页添加最低和最高价格。 在这种状况下,具备最低价格的广告的位置也可使用。github

代码实现
spring
- More feature
- How many pages user visited
- Standard deviation of prices
- Most visited page
- Many, many more
若是没有特征能够像这样使用groupby呢?可使用最近邻点app
Neighbors
- Explicit group is not needed
- More flexible
- Much harder to implement
Examplesless
- Number of houses in 500m, 1000m,..
- Average price per square meter in 500m, 1000m,..
- Number of schools/supermarkets/parking lots in 500m, 1000m,..
- Distance to colsest subway station
讲师在Springleaf
比赛中使用了它。ide
KNN features in springleaf
- Mean encode all the variables
- For every point, find 2000 nearst neighbors using Bray-Curtis metric
$$\frac{\sum{|u_i - v_i|}}{\sum{|u_i + v_i|}}$$
- Calculate various features from those 2000 neighbors
Evaluate学习
- Mean target of neatrest 5,10,15,500,2000, neighbors
- Mean distance to 10 closest neighbors
- Mean distance to 10 closest neighbors with target 1
- Mean distance to 10 closest neighbors with target 0
- Example of feature fusion

Notes about Matrix Fatorization
- Can be apply only for some columns
- Can provide additional diversity
- Good for ensembles
- It is lossy transformation.Its' efficirncy depends on:
- Particular task
- Number of latent factors
Implementtation
- Serveral MF methods you can find in sklearn
- SVD and PCA
- Standart tools for Matrix Fatorization
- TruncatedSVD
- Works with sparse matrices
- Non-negative Matrix Fatorization(NMF)
- Ensures that all latent fators are non-negative
- Good for counts-like data
NMF for tree-based methods
non-negative matrix factorization
简称NMF,它以一种使数据更适合决策树的方式转换数据。
flex
能够看出,NMF变换数据造成平行于轴的线。
因子分解
可使用与线性模型的技巧来分解矩阵。

Conclusion
- Matrix Factorization is a very general approach for dimensionality reduction and feature extraction
- It can be applied for transforming categorical features into real-valued
- Many of tricks trick suitable for linear models can be useful for MF
Feature interactions
特征值的全部组合
假设咱们正在构建一个预测模型,在网站上显示的最佳广告横幅。
... |
auto_part |
game_news |
... |
0 |
... |
music_tickets |
music_news |
.. |
1 |
... |
mobile_phones |
auto_blog |
... |
0 |
将广告横幅自己的类别和横幅将显示的网站类别,进行组合将构成一个很是强的特征。
... |
auto_part | game_news |
... |
0 |
... |
music_tickets | music_news |
.. |
1 |
... |
mobile_phones | auto_blog |
... |
0 |
构建这两个特征的组合特征ad_site
从技术角度来看, 有两种方法能够构建这种交互。
方法1

方法2

- 类似的想法也可用于数值变量

事实上,这不限于乘法操做,还能够是其余的
- Multiplication
- Sum
- Diff
- Division
- ..
Practival Notes
- We have a lot of possible interactions -N*N for N features.
- a. Even more if use several types in interactions
- Need ti reduce it's number
- a. Dimensionality reduction
- b. Feature selection
经过这种方法生成了大量的特征,可使用特征选择或降维的方法减小特征。如下用特征选择举例说明

Interactions' order
- We looked at 2nd order interactions.
- Such approach can be generalized for higher orders.
- It is hard to do generation and selection automatically.
- Manual building of high-order interactions is some kind of art.

看一下决策树。 让咱们将每一个叶子映射成二进制特征。 对象叶子的索引能够用做新分类特征的值。 若是咱们不使用单个树而是使用它们的总体。 例如,随机森林, 那么这种操做能够应用于每一个条目。 这是一种提取高阶交互的强大方法。
In sklearn:
tree_model.apply()
In xgboost:
booster.predict(pred_leaf=True)
Conclusion
- We looked at ways to build an interaction of categorical attributes
- Extended this approach to real-valued features
- Learn how to extract features via decision trees
t-SNE
用于探索数据分析。能够被视为从数据中获取特征的方法。
Practical Notes
- Result heavily depends on hyperparameters(perplexity)
- Good practice is to use several projections with different perplexities(5-100)
- Due to stochastic nature, tSNE provides different projections even for the same data\hyperparams
- Train and test should be projected together
- tSNE runs for a long time with a big number of features
- it is common to do dimensionality reduction before projection.
- Implementation of tSNE can be found in sklearn library.
- But personally I perfer you use stand-alone implementation python package tsne due to its' faster speed.
Conclusion
- tSNE is a great tool for visualization
- It can be used as feature as well
- Be careful with interpretation of results
- Try different perplexities
矩阵分解: