合成控制的数据集 synthetic_control.data 能够从 此处下载,总共由600行X60列double型的数据组成, 意思是有600个元组,每一个元组是一个时间序列。java
1. 把数据拷到集群上,放到kmeans/目录下算法
hadoop fs -mv synthetic_control.data kmeans/synthetic_control.data
2. 输入以下mahout命令进行KMeans聚类分析apache
当命令中有这个--numClusters( 表明聚类结果中簇的个数)参数的话,它会采用Kmeans聚类。若是没有配置这个参数的话,它会先采用Canopy聚类,-t1和-t2是用于Canopy聚类的配置参数。数组
从Mahout源码能够分析出:进行KMeans聚类时,会产生四个步骤。app
其中 前俩步就是 KMeans聚类算法的准备工做。dom
主要流程能够从org.apache.mahout.clustering.syntheticcontrol.kmeans.Job#run()方法里看出一些端倪。ide
public static void run(Configuration conf, Path input, Path output, DistanceMeasure measure, int k, double convergenceDelta, int maxIterations) throws Exception { //1. synthetic_control.data存储的文本格式,转换成Key/Value格式,存入到output/data目录。Key为保存一个Integer的Text类型, Value为VectorWritable类型。 Path directoryContainingConvertedInput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT); log.info("Preparing Input"); InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector"); //2. 随机产生几个cluster,存入到output/clusters-0/part-randomSeed文件里。Key为Text, Value为ClusterWritable类型。 log.info("Running random seed to get initial clusters"); Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); clusters = RandomSeedGenerator.buildRandom(conf, directoryContainingConvertedInput, clusters, k, measure); //3. 进行聚类迭代运算,为每个簇从新选出cluster centroid中心 log.info("Running KMeans"); KMeansDriver.run(conf, directoryContainingConvertedInput, clusters, output, measure, convergenceDelta, maxIterations, true, 0.0, false); //4. 根据上面选出的中心,把output/data里面的记录,都分配给各个cluster。输出运算结果,把sequencefile格式转化成textfile格式展现出来 // run ClusterDumper ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-*-final"), new Path(output, "clusteredPoints")); clusterDumper.printClusters(null); }
public static Path buildClusters(Configuration conf, Path input, Path clustersIn, Path output, DistanceMeasure measure, int maxIterations, String delta, boolean runSequential) throws IOException, InterruptedException, ClassNotFoundException { double convergenceDelta = Double.parseDouble(delta); //从output/clusters-0/part-randomSeed文件里读出Cluster数据,放入到clusters变量中。 List<Cluster> clusters = Lists.newArrayList(); KMeansUtil.configureWithClusterInfo(conf, clustersIn, clusters); if (clusters.isEmpty()) { throw new IllegalStateException("No input clusters found in " + clustersIn + ". Check your -c argument."); } //把聚类策略(控制收敛程度)写进output/clusters-0/_policy文件中 //同时,每一个簇cluster在output/clusters-0/下对应生成part-000xx文件 Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); //开始迭代maxIterations次执行Map/Reduce if (runSequential) { ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations); } return output; }
调整cluster中心的Job的代码以下:oop
public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations) throws IOException, InterruptedException, ClassNotFoundException { ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath); Path clustersOut = null; int iteration = 1; while (iteration <= numIterations) { conf.set(PRIOR_PATH_KEY, priorPath.toString()); String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath; Job job = new Job(conf, jobName); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(ClusterWritable.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(ClusterWritable.class); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); //核心算法就在这个CIMapper和CIReducer里面 job.setMapperClass(CIMapper.class); job.setReducerClass(CIReducer.class); FileInputFormat.addInputPath(job, inPath); clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration); priorPath = clustersOut; FileOutputFormat.setOutputPath(job, clustersOut); job.setJarByClass(ClusterIterator.class); if (!job.waitForCompletion(true)) { throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath); } ClusterClassifier.writePolicy(policy, clustersOut); FileSystem fs = FileSystem.get(outPath.toUri(), conf); iteration++; if (isConverged(clustersOut, conf, fs)) { break; } } //把最后一次迭代的结果目录重命名,加一个final Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX); FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn); }
CIMapper代码以下:学习
@Override protected void map(WritableComparable<?> key, VectorWritable value, Context context) throws IOException, InterruptedException { Vector probabilities = classifier.classify(value.get()); Vector selections = policy.select(probabilities); for (Iterator<Element> it = selections.iterateNonZero(); it.hasNext();) { Element el = it.next(); classifier.train(el.index(), value.get(), el.get()); } }
在这里面须要厘清ui
org.apache.mahout.clustering.iterator.KMeansClusteringPolicy
和
org.apache.mahout.clustering.classify.ClusterClassifier
这两个类。
前者是聚类的策略,能够说它提供聚类的核心算法。
后者是聚类的分类器,它的功能是基于聚类策略把数据进行分类。
ClusterClassifier.classify()求得某点到全部cluster中心的距离,获得的是一个数组。
@Override public Vector classify(Vector data, ClusterClassifier prior) { List<Cluster> models = prior.getModels(); int i = 0; Vector pdfs = new DenseVector(models.size()); for (Cluster model : models) { pdfs.set(i++, model.pdf(new VectorWritable(data))); } return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum()); }
上述代码中的org.apache.mahout.clustering.iterator.DistanceMeasureCluster.pdf(VectorWritable)求该点到Cluster形心的距离,其算法代码以下:
@Override public double pdf(VectorWritable vw) { return 1 / (1 + measure.distance(vw.get(), getCenter())); }
pdfs.zSum()是pdfs double数组的和。而后再对pdfs进行归一化处理。
所以最后select()用于选出类似度最大的cluster的下标,而且对其赋予权重1.0。以下所示:
@Override public Vector select(Vector probabilities) { int maxValueIndex = probabilities.maxValueIndex(); Vector weights = new SequentialAccessSparseVector(probabilities.size()); weights.set(maxValueIndex, 1.0); return weights; }
接下来,为了从新获得新的中心,经过org.apache.mahout.clustering.classify.ClusterClassifier.train(int, Vector, double)为训练数据,即最后在AbstractCluster里面准备数据。
public void observe(Vector x, double weight) { if (weight == 1.0) { observe(x); } else { setS0(getS0() + weight); Vector weightedX = x.times(weight); if (getS1() == null) { setS1(weightedX); } else { getS1().assign(weightedX, Functions.PLUS); } Vector x2 = x.times(x).times(weight); if (getS2() == null) { setS2(x2); } else { getS2().assign(x2, Functions.PLUS); } } }
在CIReducer里面,对属于同一个Cluster里面的数据进行合并,而且求出centroid形心。
@Override protected void reduce(IntWritable key, Iterable<ClusterWritable> values, Context context) throws IOException, InterruptedException { Iterator<ClusterWritable> iter = values.iterator(); Cluster first = iter.next().getValue(); // there must always be at least one while (iter.hasNext()) { Cluster cluster = iter.next().getValue(); first.observe(cluster); } List<Cluster> models = Lists.newArrayList(); models.add(first); classifier = new ClusterClassifier(models, policy); classifier.close(); context.write(key, new ClusterWritable(first)); }
求centroid算法代码以下:
@Override public void computeParameters() { if (getS0() == 0) { return; } setNumObservations((long) getS0()); setTotalObservations(getTotalObservations() + getNumObservations()); setCenter(getS1().divide(getS0())); // compute the component stds if (getS0() > 1) { setRadius(getS2().times(getS0()).minus(getS1().times(getS1())).assign(new SquareRootFunction()).divide(getS0())); } setS0(0); setS1(center.like()); setS2(center.like()); }
真正对output/data记录分配给各个簇的代码是:
private static void classifyClusterMR(Configuration conf, Path input, Path clustersIn, Path output, Double clusterClassificationThreshold, boolean emitMostLikely) throws IOException, InterruptedException, ClassNotFoundException { conf.setFloat(ClusterClassificationConfigKeys.OUTLIER_REMOVAL_THRESHOLD, clusterClassificationThreshold.floatValue()); conf.setBoolean(ClusterClassificationConfigKeys.EMIT_MOST_LIKELY, emitMostLikely); conf.set(ClusterClassificationConfigKeys.CLUSTERS_IN, clustersIn.toUri().toString()); Job job = new Job(conf, "Cluster Classification Driver running over input: " + input); job.setJarByClass(ClusterClassificationDriver.class); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); //进行记录分配 job.setMapperClass(ClusterClassificationMapper.class); job.setNumReduceTasks(0); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(WeightedVectorWritable.class); FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output); if (!job.waitForCompletion(true)) { throw new InterruptedException("Cluster Classification Driver Job failed processing " + input); } }