Spark中的聚类算法

Spark - Clustering

官方文档:https://spark.apache.org/docs/2.2.0/ml-clustering.htmlhtml

这部分介绍MLlib中的聚类算法;python

目录:算法

  • K-means:
    • 输入列;
    • 输出列;
  • Latent Dirichlet allocation(LDA):
  • Bisecting k-means;
  • Gaussian Mixture Model(GMM):
    • 输入列;
    • 输出列;

K-means

k-means是最经常使用的聚类算法之一,它将数据汇集到预先设定的N个簇中;apache

KMeans做为一个预测器,生成一个KMeansModel做为基本模型;lua

输入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

输出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center

例子

from pyspark.ml.clustering import KMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

LDA

LDA是一个预测器,同时支持EMLDAOptimizer和OnlineLDAOptimizer,生成一个LDAModel做为基本模型,专家使用者若是有须要能够将EMLDAOptimizer生成的LDAModel转为DistributedLDAModel;spa

from pyspark.ml.clustering import LDA

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)

ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)

Bisecting k-means

Bisecting k-means是一种使用分裂方法的层次聚类算法:全部数据点开始都处在一个簇中,递归的对数据进行划分直到簇的个数为指定个数为止;code

Bisecting k-means通常比K-means要快,可是它会生成不同的聚类结果;orm

BisectingKMeans是一个预测器,并生成BisectingKMeansModel做为基本模型;htm

与K-means相比,二分K-means的最终结果不依赖于初始簇心的选择,这也是为何一般二分K-means与K-means结果每每不同的缘由;递归

from pyspark.ml.clustering import BisectingKMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))

# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
    print(center)

Gaussian Mixture Model(GMM)

GMM表示一个符合分布,从一个高斯子分布中提取点,每一个点都有其本身 的几率,spark.ml基于给定数据经过指望最大化算法来概括最大似然模型实现算法;

输入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

输出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center
probabilityCol Vector probability Probability of each cluster

例子

from pyspark.ml.clustering import GaussianMixture

# loads data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dataset)

print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)
相关文章
相关标签/搜索