#Random Forest——随机森林 上一篇是讲到了决策树,这篇就来说一下树的集合,随机森林。 #①Aggregation Model 随机森林仍是没有脱离聚合模型这块,以前学过两个aggregation model,bagging和decision tree,一个是边learning边uniform。首先是boostrap方式获得数据D1,以后训练作平均;另外一个也是边learning可是作的是condition,直接用数据D作conditional切分。 git
#②Random Forest github
#④Feature Selection 在feature选择的过程当中,还有一类问题要注意。选择的过程当中有可能遇到多余的问题,好比抽取到生日和年龄,这就尴尬了。另外一种是不相关的。 算法
#⑤代码实现bootstrap
最后,咱们经过实际的例子来看一下RF的特色。首先,仍然是一个二元分类的例子。以下图所示,左边是一个C&RT树没有使用bootstrap获得的模型分类效果,其中不一样特征之间进行了随机组合,因此有斜线做为分类线;中间是由bootstrap(N’=N/2)后生成的一棵决策树组成的随机森林,图中加粗的点表示被bootstrap选中的点;右边是将一棵决策树进行bagging后的分类模型,效果与中间图是同样的,都是一棵树。 数组
而后,咱们再来看一个比较复杂的例子,二维平面上分布着许多离散点,分界线形如sin函数。当只有一棵树的时候(t=1),下图左边表示单一树组成的RF,右边表示全部树bagging组合起来构成的RF。由于只有一棵树,因此左右两边效果一致。 bash
以后就是真正的代码实现了 这里使用的仍是随机选择特征的方法。 首先是一个特征选择函数:app
def choose_samples(self, data, k):
'''choose the feature from data input:data, type = list output:k '''
n, d = np.shape(data)
feature = []
for j in range(k):
feature.append(rd.randint(0, d - 2))
index = []
for i in range(n):
index.append(rd.randint(0, n-1))
data_samples = []
for i in range(n):
data_tmp = []
for fea in feature:
data_tmp.append(data[i][fea])
data_tmp.append(data[i][-1])
data_samples.append(data_tmp)
pass
return data_samples, feature
pass
复制代码
在data数据里面选择出k维的数据。 以后就是随机森林的创建了,使用的决策树是上篇文章实现的决策树,尽可能作到全是本身实现的:dom
def random_forest(self, data, trees_num):
'''create a forest input:data, type = list output:trees_result, trees_feature '''
decisionTree = tree.decision_tree()
trees_result = []
trees_feature = []
d = np.shape(data)[1]
if d > 2:
k = int(math.log(d - 1, 2)) + 1
else:
k = 1
for i in range(trees_num):
print('The ', i, ' tree. ')
data_samples, feature = self.choose_samples(data, k)
t = decisionTree.build_tree(data_samples)
trees_result.append(t)
trees_feature.append(feature)
pass
return trees_result, trees_feature
复制代码
其实都很常规,最后返回的是树的数量和选取的特征。 以后就是一个切割数据和加载数据的工具函数:函数
def split_data(data_train, feature):
'''select the feature from data input:data, feature output:data, type = list '''
m = np.shape(data_train)[0]
data = []
for i in range(m):
data_tmp = []
for x in feature:
data_tmp.append(data_train[i][x])
data_tmp.append(data_train[i][-1])
data.append(data_tmp)
return data
def load_data():
'''use the boston dataset from sklearn'''
print('loading data......')
dataSet = load_breast_cancer()
data = dataSet.data
target = dataSet.target
for i in range(len(target)):
if target[i] == 0:
target[i] = -1
dataframe = pd.DataFrame(data)
dataframe.insert(np.shape(data)[1], 'target', target)
dataMat = np.mat(dataframe)
X_train, X_test, y_train, y_test = train_test_split(dataMat[:, 0:-1], dataMat[:, -1], test_size=0.3, random_state=0)
data_train = np.hstack((X_train, y_train))
data_train = data_train.tolist()
X_test = X_test.tolist()
return data_train, X_test, y_test
复制代码
load_data是把数据3,7切分,测试和训练。 而后就是预测函数和计算准确度的函数了:工具
def get_predict(self, trees_result, trees_feature, data_train):
'''predict the result input:trees_result, trees_feature, data output:final_prediction '''
decisionTree = tree.decision_tree()
m_tree = len(trees_result)
m = np.shape(data_train)[0]
result = []
for i in range(m_tree):
clf = trees_result[i]
feature = trees_feature[i]
data = tool.split_data(data_train, feature)
result_i = []
for i in range(m):
result_i.append( list((decisionTree.predict(data[i][0 : -1], clf).keys()))[0] )
result.append(result_i)
final_predict = np.sum(result, axis = 0)
return final_predict
def cal_correct_rate(self, target, final_predict):
m = len(final_predict)
corr = 0.0
for i in range(m):
if target[i] * final_predict[i] > 0:
corr += 1
pass
return corr/m
pass
复制代码
这个和以前决策树的差很少,也是调用了以前的代码。 最后就是入口函数:
def running():
'''entrance'''
data_train, text, target = load_data()
forest = randomForest()
predic = []
for i in range(1, 20):
trees, features = forest.random_forest(data_train, i)
predictions = forest.get_predict(trees, features, text)
accuracy = forest.cal_correct_rate(target, predictions)
print('The forest has ', i, 'tree', 'Accuracy : ' , accuracy)
predic.append(accuracy)
plt.xlabel('Number of tree')
plt.ylabel('Accuracy')
plt.title('The relationship between tree number and accuracy')
plt.plot(range(1, 20), predic, color = 'orange')
plt.show()
pass
if __name__ == '__main__':
running()
复制代码
计算了1到20课树他们以前准确率的变化,画了对比图。
全部代码GitHub上: github.com/GreenArrow2…