The Practical Importance of Feature Selection（变量筛选重要性）

时间 2019-12-08

标签 practical importance feature selection 变量筛选重要性繁體版

原文原文链接

python信用评分卡（附代码，博主录制）

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=sharehtml

原文连接python

https://www.kdnuggets.com/2017/06/practical-importance-feature-selection.html算法

Feature selection is useful on a variety of fronts: it is the best weapon against the Curse of Dimensionality; it can reduce overall training times; and it is a powerful defense against overfitting, increasing generalizability.less

特征选择在各个方面都颇有用：它是反对过分拟合的最佳武器; 它能够减小总体培训时间; 它是对过分拟合，增长广泛性的有力防护。机器学习

By Matthew Mayo, KDnuggets.ide

If you wanted to classify animals, for example, based on a plethora of relevant collected data, you would quickly find that all sorts of potential data attributes, or features, were relatively unhelpful for classification. For example, given that most living creatures have precisely 1 heart, this particular feature would not be beneficial, from a learning perspective. On the other hand, an attribute denoting whether or not a given animal is hoofed would likely be a powerful predictor.函数

若是您想对动物进行分类，例如，基于过多的相关收集数据，您会很快发现各类潜在的数据属性或特征对于分类而言相对无益。例如，鉴于大多数生物只有1颗心脏，从学习的角度来看，这一特殊功能并非有益的。另外一方面，表示给定动物是否有蹄的属性多是强有力的预测因子。学习

Further, using all of these irrelevant attributes, mixed in with the powerful predictors, may actually have a negative effect on the resulting model. This is to say nothing of the increased training times that may come along with the inclusion of useless attributes, or the overfitting which may occur on the training data.ui

此外，使用全部这些无关属性，与强大的预测变量混合，实际上可能对结果模型产生负面影响。这也就是说，可能伴随着包含无用属性或训练数据可能出现的过分拟合而增长的训练时间。this

Feature selection is the process of narrowing down a subset of features, or attributes, to be used in the predictive modeling process. Feature selection is useful on a variety of fronts: it is the best weapon against the Curse of Dimensionality; it can reduce overall training times; and it is a powerful defense against overfitting, increasing model generalizability.

特征选择是缩小要在预测建模过程当中使用的特征或属性子集的过程。特征选择在各个方面都颇有用：它是反对维度诅咒的最佳武器; 它能够减小总体培训时间; 它是对过分拟合的强大防护，增长了模型的广泛性。

Something I read recently -- written so eloquently and concisely by data scientist Rubens Zimbres -- alludes to the importance of feature selection from a practical standpoint:

After some experiences, using stacked neural nets, parallel neural nets, asymmetric configs, simple neural nets, multiple layers, dropouts, activation functions etc there is one conclusion: There's NOTHING like a good Feature Selection.

Having had some previous professional contacts with Rubens Zimbres in the past, I reached out to him for some elaboration. He provided the following:

Feature selection should be one of the main concerns for a Data Scientist. Accuracy and generalization power can be leveraged by a correct feature selection, based in correlation, skewness, t-test, ANOVA, entropy and information gain.

Many times a correct feature selection allows you to develop simpler and faster Machine Learning models. Consider the picture below (Support Vector Machine classification of the IRIS dataset): on the left side a wrong variable selection is presented. The linear kernel cannot handle the classification task properly, neither the radial basis function kernel. On the right side, petal width and petal length were selected as features and even the linear kernel is quite accurate. A correct variable selection, a good algorithm choice and hyperparameter tuning are the keys to success. Picture below made with Python.

特征选择应该是数据科学家的主要关注点之一。基于相关性，偏度，t检验，ANOVA，熵和信息增益，经过正确的特征选择能够利用准确性和泛化能力。

不少时候，正确的功能选择可让您开发更简单，更快速的机器学习模型。考虑下面的图片（IRIS数据集的支持向量机分类）：在左侧显示错误的变量选择。线性内核没法正确处理分类任务，也不能处理径向基函数内核。在右侧，选择花瓣宽度和花瓣长度做为特征，甚至线性内核也很是准确。正确的变量选择，良好的算法选择和超参数调整是成功的关键。下面用Python制做的图片。

In a time when ample processing power can tempt us to think that feature selection may not be as relevant as it once was, it's important to remember that this only accounts for one of the numerous benefits of informed feature selection -- decreased training times. As Zimbres notes above, with a simple concrete example, feature selection can quite literally mean the difference between valid, generalizable models and a big waste of time.

在充足的处理能力能够诱使咱们认为特征选择可能不像之前那样具备相关性的时代，重要的是要记住，这仅仅是知情特征选择的众多好处之一 - 减小了训练时间。正如Zimbres上面所说，经过一个简单的具体例子，特征选择能够彻底意味着有效的，可推广的模型之间的差别和浪费大量时间。

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149