特征选择

时间 2019-11-26

标签特征选择繁體版

原文原文链接

1, 去掉取值变化小的特征（Removing features with low variance）

sklearn.feature_selection.VarianceThreshold(threshold=0.0)git

2, 单变量特征选择（Univariate feature selection）

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)算法

选择前k个分数较高的特征，去掉其余的特征dom

sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)机器学习

f_regression（单因素线性回归试验）用做回归
chi2卡方检验，f_classif（方差分析的F值）等用做分类函数

选择必定百分比的最高的评分的特征。学习

sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05)spa

根据配置的参选搜索.net

sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode='percentile', param=1e-05设计

3,递归特征消除Recursive feature elimination （RFE）

递归特征消除的主要思想是反复的构建模型（如SVM或者回归模型）而后选出最好的（或者最差的）的特征（能够根据系数来选），把选出来的特征选择出来，而后在剩余的特征上重复这个过程，直到全部特征都遍历了。这个过程当中特征被消除的次序就是特征的排序。所以，这是一种寻找最优特征子集的贪心算法。
RFE的稳定性很大程度上取决于在迭代的时候底层用哪一种模型。例如，假如RFE采用的普通的回归，没有通过正则化的回归是不稳定的，那么RFE就是不稳定的；假如采用的是Ridge，而用Ridge正则化的回归是稳定的，那么RFE就是稳定的。rest

class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, scoring=None, estimator_params=None, verbose=0)

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

# Create the RFE object and rank each pixel
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)

# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()

4, Feature selection using SelectFromModel

SelectFromModel 是一个 meta-transformer，能够和在训练完后有一个coef_ 或者 feature_importances_ 属性的评估器（机器学习算法）一块儿使用。
若是相应的coef_ 或者feature_importances_ 的值小于设置的阀值参数，这些特征能够视为不重要或者删除。除了指定阀值参数外，也能够经过设置一个字符串参数，使用内置的启发式搜索找到夜歌阀值。可使用的字符串参数包括：“mean”, “median” 以及这两的浮点乘积，例如“0.1*mean”.

sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False)

与Lasso一块儿使用，从boston数据集中选择最好的两组特征值。

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# Load the boston dataset.
boston = load_boston()
X, y = boston['data'], boston['target']

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()

# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
n_features = sfm.transform(X).shape[1]

# Reset the threshold till the number of features equals two.
# Note that the attribute can be set directly instead of repeatedly
# fitting the metatransformer.
while n_features > 2:
sfm.threshold += 0.1
X_transform = sfm.transform(X)
n_features = X_transform.shape[1]

# Plot the selected two features from X.
plt.title(
"Features selected from Boston using SelectFromModel with "
"threshold %0.3f." % sfm.threshold)
feature1 = X_transform[:, 0]
feature2 = X_transform[:, 1]
plt.plot(feature1, feature2, 'r.')
plt.xlabel("Feature number 1")
plt.ylabel("Feature number 2")
plt.ylim([np.min(feature2), np.max(feature2)])
plt.show()

4.1,L1-based feature selection

L1正则化将系数w的l1范数做为惩罚项加到损失函数上，因为正则项非零，这就迫使那些弱的特征所对应的系数变成0。所以L1正则化每每会使学到的模型很稀疏（系数w常常为0），

这个特性使得L1正则化成为一种很好的特征选择方法。

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
＃(150, 4)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape
＃(150, 3)

4.2, 随机稀疏模型Randomized sparse models

面临一些相互关联的特征是基于L1的稀疏模型的限制，由于模型只选择其中一个特征。为了减小这个问题，可使用随机特征选择方法，经过打乱设计的矩阵或者子采样的数据并，屡次从新估算稀疏模型，而且统计有多少次一个特定的回归量是被选中。

RandomizedLasso使用Lasso实现回归设置

sklearn.linear_model.RandomizedLasso(alpha='aic', scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, fit_intercept=True, verbose=False, normalize=True, precompute='auto', max_iter=500, eps=2.2204460492503131e-16, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))
1
RandomizedLogisticRegression 使用逻辑回归 logistic regression，适合分类任务

sklearn.linear_model.RandomizedLogisticRegression(C=1, scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, tol=0.001, fit_intercept=True, verbose=False, normalize=True, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))

4.3, 基于树的特征选择Tree-based feature selection
基于树的评估器 (查看sklearn.tree 模块以及在sklearn.ensemble模块中的树的森林) 能够被用来计算特征的重要性，根据特征的重要性去掉可有可无的特征 (当配合sklearn.feature_selection.SelectFromModel meta-transformer):

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
＃(150, 4)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape
＃(150, 2)

5, Feature selection as part of a pipeline
在进行学习以前，特征选择一般被用做预处理步骤。在scikit-learn中推荐使用的处理的方法是sklearn.pipeline.Pipeline

sklearn.pipeline.Pipeline(steps)
1
Pipeline of transforms with a final estimator.
Sequentially 应用一个包含 transforms and a final estimator的列表，pipeline中间的步骤必须是‘transforms’, 也就是它们必须完成fit 以及transform 方法s. final estimator 仅仅只须要完成 fit方法.

使用pipeline是将来组合多个能够在设置不一样参数时进行一块儿交叉验证的步骤。所以，它容许设置不一样步骤中的参数事使用参数名，这些参数名使用‘__’进行分隔。以下实例中所示：

from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = samples_generator.make_classification(
... n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
# You can set the parameters using the names issued
# For instance, fit using a k of 10 in the SelectKBest
# and a parameter 'C' of the svm
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
...
Pipeline(steps=[...])
prediction = anova_svm.predict(X)
anova_svm.score(X, y)
0.77...
# getting the selected features chosen by anova_filter
anova_svm.named_steps['anova'].get_support()
＃array([ True, True, True, False, False, True, False, True, True, True,
False, False, True, False, True, False, False, False, False,
True], dtype=bool)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
简单语法示例：

clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)

做者：面向将来的历史来源：CSDN 原文：https://blog.csdn.net/a1368783069/article/details/52048349