机器学习笔记(2) 集成学习随机森林先导知识

时间 2019-12-29

标签机器学习笔记集成随机森林先导知识繁體版

原文原文链接

每一种机器学习算法均可以看作是一种看待数据的视角.html

就像咱们看待一个问题,一个观点同样.每一种视角必然有他合理的地方,也有他片面的地方.对机器学习而言,也是同样.因此为了提升咱们对数据的了解程度,咱们要尽量地从多个视角考察咱们的数据. 这样对新的test data,不论是分类仍是回归,咱们才可能有更高的预测准确率.python

实际上上述过程,就是所谓的ensemble。git

集成学习github

机器学习中的集成学习就是将选择若干算法，针对一样的train data去训练模型，而后看看结果，使用投票机制，少数服从多数，用多数算法给出的结果看成最终的决策依据，这就是集成学习的核心思路.算法

1.votingapache

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[
    ('log_clf', LogisticRegression()), 
    ('svm_clf', SVC()),
    ('dt_clf', DecisionTreeClassifier(random_state=666))],
                             voting='hard')

VotingClassifierbootstrap

class sklearn.ensemble.VotingClassifier(estimators, voting='hard', weights=None, n_jobs=1, flatten_transform=None)less

这里解释一下voting这个参数：dom

举一个例子，假设有3个模型，针对同一个二分类问题，将每种类别都计算出了几率：机器学习

模型1 A-99%，B-1%
模型2 A-49%，B-51%
模型3 A-49%，B-51%

若是单纯地投票的话,会分类为B. 这就是所谓的hard voting。

然而显然是有问题的,由于模型1很是确认类别应该是A（99%），而模型2和模型3几乎没法认定是A仍是B（49% VS 51%）,那么这种状况下，将结果分类为A是更合理的.

这也就引入了soft voting。即根据几率来投票. p(A)=(0.99 + 0.49 + 0.49)/3 = 0.657 p(B)=(0.01+0.51+0.51)/3 = 0.343 p(A)>p(B)因此应该分类为A。

2.bagging

从投票的角度来讲,虽然有了不少机器学习算法,可是仍是不够多！因此咱们想建立尽量多的子模型,集成各类子模型的意见.同时又要保证子模型之间要有差别,不然就失去了投票的意义.

咱们想要尽量多的子模型
子模型之间要有差别性

那么怎么保证子模型的差别性呢？

一种简单的方法：让机器学习算法只训练训练集的一部分. 那么这又带来一个问题,每一个子模型只学习到了一部分的训练数据信息,那么这种子模型的预测准确率不就很低了吗？答案是确定的,单个子模型的准确率确实会下降,可是没有关系.

好比单个子模型的准确率为51%

那么整个系统的准确率为：$$P=\sum_{i=m/2}^mC_m^ip^i(1-p)^{m-i}$$

import numpy as np from scipy.special import comb, permdef f(x,n): r = 0 for i in range(x,n+1): r += comb(n,i)*np.power(0.51,i)*np.power(0.49,n-i) return r 

f(2,3) = 0.5149980000000001
f(251,500) = 0.6564399889597903

由以上代码能够看到,当子模型的准确率为51%时，若是一个系统有3个子模型,那么系统的准确率为51.5%。当一个系统有500个子模型时,准确率则到了65.6%.

怎么样从训练数据中取出一部分呢？即如何取样？

放回取样 bagging 更经常使用.
不放回取样 pasting

咱们把放回取样叫bagging，不放回取样叫pasting。

放回取样的方式能够训练更多的模型. 在一次模型的fit中,好比样本500,每次取100,不放回取样最多只能训练5个子模型. 放回能够训练成千上万个子模型.而且由pasting能训练的次数太少,这500个样本划分红怎样的5个100就有讲究了,可能会对最后的结果带来很大的影响. bagging的话在成千上万个子模型的训练中就必定程度上消除了这种随机性.

out of bag(OOB)

放回取样的一个问题是:在有限次的取样过程当中,有一部分样本可能一直没被选取到.大概有37%的样本没有取到.

数学证实能够参考一下：

37%的由来

咱们能够用这部分没被取样的数据集做为咱们的验证集. sklearn中的oob_score_就是相应的验证集获得的分数.

sklearn中的bagging

class sklearn.ensemble.BaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)[source]¶

base_estimator : object or None, optional (default=None)

The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.

n_estimators : int, optional (default=10)

The number of base estimators in the ensemble.

max_samples : int or float, optional (default=1.0)

The number of samples to draw from X to train each base estimator.

If int, then draw max_samples samples.

If float, then draw max_samples * X.shape[0] samples.

max_features : int or float, optional (default=1.0)

The number of features to draw from X to train each base estimator.

If int, then draw max_features features.

If float, then draw max_features * X.shape[1] features.

bootstrap : boolean, optional (default=True)

Whether samples are drawn with replacement.

bootstrap_features : boolean, optional (default=False)

Whether features are drawn with replacement.

oob_score : bool, optional (default=False)

Whether to use out-of-bag samples to estimate the generalization error.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble. See the Glossary.

New in version 0.17: warm_start constructor parameter.

n_jobs : int or None, optional (default=None)

The number of jobs to run in parallel for both fit and predict. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

verbose : int, optional (default=0)

Controls the verbosity when fitting and predicting.

max_samples 每一个子模型取样的样本数

bootstrap 为true表示放回取样

oob_score 是否使用out-of-bag samples作验证

1 bagging_clf = BaggingClassifier(DecisionTreeClassifier(),
2                                n_estimators=500, max_samples=100,
3                                bootstrap=True, oob_score=True,
4                                n_jobs=-1)
5 bagging_clf.fit(X, y)

随机森林Random Forest

了解了前面ensemble的相关概念后,就很容易理解随机森林了. 所谓随机森林,就是由不少个decision tree作ensemble获得的模型.

后面的文章会继续详细介绍random forest

机器学习笔记系列文章列表

机器学习笔记(1) 决策树

机器学习笔记(2) 集成学习随机森林先导知识

机器学习笔记(2) 集成学习 随机森林先导知识

机器学习笔记(2) 集成学习随机森林先导知识