官方文档:https://spark.apache.org/docs/2.2.0/ml-clustering.htmlhtml
这部分介绍MLlib中的聚类算法;python
目录:算法
k-means是最经常使用的聚类算法之一,它将数据汇集到预先设定的N个簇中;apache
KMeans做为一个预测器,生成一个KMeansModel做为基本模型;lua
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | features | Feature vector |
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | prediction | Predicted cluster center |
from pyspark.ml.clustering import KMeans # Loads data. dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") # Trains a k-means model. kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) # Evaluate clustering by computing Within Set Sum of Squared Errors. wssse = model.computeCost(dataset) print("Within Set Sum of Squared Errors = " + str(wssse)) # Shows the result. centers = model.clusterCenters() print("Cluster Centers: ") for center in centers: print(center)
LDA是一个预测器,同时支持EMLDAOptimizer和OnlineLDAOptimizer,生成一个LDAModel做为基本模型,专家使用者若是有须要能够将EMLDAOptimizer生成的LDAModel转为DistributedLDAModel;spa
from pyspark.ml.clustering import LDA # Loads data. dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt") # Trains a LDA model. lda = LDA(k=10, maxIter=10) model = lda.fit(dataset) ll = model.logLikelihood(dataset) lp = model.logPerplexity(dataset) print("The lower bound on the log likelihood of the entire corpus: " + str(ll)) print("The upper bound on perplexity: " + str(lp)) # Describe topics. topics = model.describeTopics(3) print("The topics described by their top-weighted terms:") topics.show(truncate=False) # Shows the result transformed = model.transform(dataset) transformed.show(truncate=False)
Bisecting k-means是一种使用分裂方法的层次聚类算法:全部数据点开始都处在一个簇中,递归的对数据进行划分直到簇的个数为指定个数为止;code
Bisecting k-means通常比K-means要快,可是它会生成不同的聚类结果;orm
BisectingKMeans是一个预测器,并生成BisectingKMeansModel做为基本模型;htm
与K-means相比,二分K-means的最终结果不依赖于初始簇心的选择,这也是为何一般二分K-means与K-means结果每每不同的缘由;递归
from pyspark.ml.clustering import BisectingKMeans # Loads data. dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") # Trains a bisecting k-means model. bkm = BisectingKMeans().setK(2).setSeed(1) model = bkm.fit(dataset) # Evaluate clustering. cost = model.computeCost(dataset) print("Within Set Sum of Squared Errors = " + str(cost)) # Shows the result. print("Cluster Centers: ") centers = model.clusterCenters() for center in centers: print(center)
GMM表示一个符合分布,从一个高斯子分布中提取点,每一个点都有其本身 的几率,spark.ml基于给定数据经过指望最大化算法来概括最大似然模型实现算法;
Param name | Type(s) | Default | Description |
---|---|---|---|
featuresCol | Vector | features | Feature vector |
Param name | Type(s) | Default | Description |
---|---|---|---|
predictionCol | Int | prediction | Predicted cluster center |
probabilityCol | Vector | probability | Probability of each cluster |
from pyspark.ml.clustering import GaussianMixture # loads data dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt") gmm = GaussianMixture().setK(2).setSeed(538009335) model = gmm.fit(dataset) print("Gaussians shown as a DataFrame: ") model.gaussiansDF.show(truncate=False)