使用树状图作层次聚类分析

时间 2021-01-04

标签 web 算法 less ide 函数字体 idea code blog 栏目 HTML 繁體版

原文原文链接

1、实验目的web

若是您之前从未使用过树状图，那么使用树状图是查看多维数据如何汇集在一块儿的好方法。在这本笔记本中，我将简单探索经过层次分析，借助树状图将其可视化。

2、层次分析算法

层次分析是聚类分析的一种，scipy有这方面的封装包。less

linkage函数从字面意思是连接，层次分析就是不断连接的过程，最终从n条数据，通过不断连接，最终聚合成一类，算法就此中止。ide

dendrogram是用来绘制树形图的函数。函数

3、实验数据字体

grain_variety是标签，其余列为多种属性的值（特征）。idea

from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import pandas as pd
seeds_df = pd.read_csv('seeds-less-rows.csv')

seeds_df.head()

#移除grain_variety
varieties = list(seeds_df.pop('grain_variety'))

varieties
['Kama wheat',
 'Kama wheat',
 'Kama wheat',
 'Rosa wheat',
 'Rosa wheat',
 'Rosa wheat',
 'Rosa wheat',
 'Rosa wheat',
 'Canadian wheat',
 'Canadian wheat',
 'Canadian wheat',
 'Canadian wheat',
 'Canadian wheat',
 'Canadian wheat']
#查看seeds_df数据
samples = seeds_df.values

print(samples)
print('samples的维度',samples.shape)
[[14.88   14.57    0.8811  5.554   3.333   1.018   4.956 ]
 [14.69   14.49    0.8799  5.563   3.259   3.586   5.219 ]
 [14.03   14.16    0.8796  5.438   3.201   1.717   5.001 ]
 [19.31   16.59    0.8815  6.341   3.81    3.477   6.238 ]
 [17.99   15.86    0.8992  5.89    3.694   2.068   5.837 ]
 [18.85   16.17    0.9056  6.152   3.806   2.843   6.2   ]
 [19.38   16.72    0.8716  6.303   3.791   3.678   5.965 ]
 [17.36   15.76    0.8785  6.145   3.574   3.526   5.971 ]
 [13.32   13.94    0.8613  5.541   3.073   7.035   5.44  ]
 [11.43   13.13    0.8335  5.176   2.719   2.221   5.132 ]
 [11.26   13.01    0.8355  5.186   2.71    5.335   5.092 ]
 [12.46   13.41    0.8706  5.236   3.017   4.987   5.147 ]
 [11.81   13.45    0.8198  5.413   2.716   4.898   5.352 ]
 [11.23   12.88    0.8511  5.14    2.795   4.325   5.003 ]]
samples的维度 (14, 7)

4、使用linkage对samples进行层次聚类code

X = linkage(y, method='single', metric='euclidean')

sacipy中y是距离矩阵，我对此只是傻傻的理解成特征矩阵。矩阵是(m*n) ，其中m行表明m条记录,n表明n个特征blog

返回结果X是(m-1)*4的矩阵。具体含义请看下面的案例ip

mergings = linkage(samples)

#咱们发现mergings比samples小1
print('sample维度',samples.shape)
print('mergings维度',mergings.shape)
sample维度 (14, 7)
mergings维度 (13, 4)
#层次分析可视化，leaf的字体不旋转，大小为10。
#这里咱们不显示每一条数据的具体名字标签（varieties），默认以数字标签显示
dendrogram(mergings,
          leaf_rotation=0,
          leaf_font_size=10)

plt.show()
#在图中显示的数字是最细粒度的叶子，至关于每一个样本数据点。

mergings
array([[ 3.        ,  6.        ,  0.37233454,  2.        ],
       [11.        , 12.        ,  0.77366442,  2.        ],
       [10.        , 15.        ,  0.89804259,  3.        ],
       [ 5.        , 14.        ,  0.90978998,  3.        ],
       [13.        , 16.        ,  1.02732924,  4.        ],
       [ 0.        ,  2.        ,  1.18832161,  2.        ],
       [ 4.        , 17.        ,  1.28425969,  4.        ],
       [ 7.        , 20.        ,  1.62187345,  5.        ],
       [ 1.        , 19.        ,  2.02587613,  3.        ],
       [ 9.        , 18.        ,  2.13385537,  5.        ],
       [ 8.        , 23.        ,  2.323123  ,  6.        ],
       [22.        , 24.        ,  2.87625877,  9.        ],
       [21.        , 25.        ,  3.12231564, 14.        ]])

层次分析图从上到下看，依次是枝和叶。

第一列和第二列表明类标签，包含叶子和枝子。

第三列表明叶叶（或叶枝，枝枝）之间的距离

第四列表明该层次类中含有的样本数（记录数）

5、不一样的层次聚类算法

X = linkage(y, method='single', metric='euclidean')

method是指计算类间距离的方法,比较经常使用的有3种:

(1)single:最近邻,把类与类间距离最近的做为类间距

(2)average:平均距离,类与类间全部pairs距离的平均

(3)complete:最远邻,把类与类间距离最远的做为类间距

咱们写曾侧分析法函数，看看不一样的method从图中有什么区别

def hierarchy_analysis(samples,method='single'):
    mergings = linkage(samples, method=method)

    dendrogram(mergings,
              labels=varieties,
              leaf_rotation=45,
              leaf_font_size=10)
    plt.show()
#single
hierarchy_analysis(samples,method='single')

#average
hierarchy_analysis(samples,method='average')

#complete
hierarchy_analysis(samples,method='complete')

https://mmbiz.qpic.cn/mmbiz_png/ibOFjxwickib47lrRJRzP5hrlsSibKIUXLGSecV1Hk3fp7RsRknibicnHNmBKCAfSXqXePqOmRDVeDVb80xIeXhGr9nA/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1

因为数据量比较少，complete和average方法作出来的图彻底同样。

数据及代码获取

连接: https://pan.baidu.com/s/14jREwHEA3YN3LIrEHJOGhg 密码: 69s7