比较了解,这里咱们用Apriori算法及其优化算法实现。你们好,下面为你们分享的实战案例是K-频繁相机挖掘并行化算法。相信从事数据挖掘相关工做的同窗对频繁项集的相关算法java
首先说一下实验结果。对于2G,1800W条记录的数据,咱们用了18秒就算完了1-8频繁项集的挖掘。应该还算不错。算法
给出题目:apache
本题的较第四题难度更大。咱们在写程序的时候必定要注意写出的程序是并行化的,而不是只在client上运行的单机程序。否数据结构
则你的算法效率将让你跌破眼镜。此外还须要对算法作相关优化。在这里主要和你们交流一下算法思路和相关优化。优化
对于Apriori算法的实如今这里不作过多赘述,百度一下大片大片。在Spark上实现这个算法的时候主要分为两个阶段第一阶段spa
是一个总体的循环求出每一个项集的阶段,第二阶段主要是针对第i个项集求出第i+1项集的候选集的阶段。scala
对于这个算法能够作以下优化:设计
- 观察!这点很重要,通过观察能够发现有大量重复的数据,所谓方向不对努力白费也是这个道理,首先须要压缩重复的数据。否则会作许多无用功。
- 设计算法的时候必定要注意是并行化的,你们可能很疑惑,Spark不就是并行化的么?但是你一不当心可能就写成只在client端运行的算法了。
- 由于数据量比较大,切记多使用数据持久化以及BroadCast广播变量对中间数据进行相应处理。
- 数据结构的优化,BitSet是一种优秀的数据结构他只需一位就能够存储以个整形数,对于所给出的数据都是整数的状况特别适用。
下面给出算法实现源码:
- import scala.util.control.Breaks._
- import scala.collection.mutable.ArrayBuffer
- import java.util.BitSet
- import org.apache.spark.SparkContext
- import org.apache.spark.SparkContext._
- import org.apache.spark._
- object FrequentItemset {
- def main(args: Array[String]) {
- if (args.length != 2) {
- println("USage:<Datapath> <Output>")
- }
- //initial SparkContext
- val sc = new SparkContext()
- val SUPPORT_NUM = 15278611 //Transactions total is num=17974836, SUPPORT_NUM = num*0.85
- val TRANSACITON_NUM = 17974836.0
- val K = 8
- //All transactions after removing transaction ID, and here we combine the same transactions.
- val transactions = sc.textFile(args(0)).map(line =>
- line.substring(line.indexOf(" ") + 1).trim).map((_, 1)).reduceByKey(_ + _).map(line => {
- val bitSet = new BitSet()
- val ss = line._1.split(" ")
- for (i <- 0 until ss.length) {
- bitSet.set(ss(i).toInt, true)
- }
- (bitSet, line._2)
- }).cache()
- //To get 1 frequent itemset, here, fi represents frequent itemset
- var fi = transactions.flatMap { line =>
- val tmp = new ArrayBuffer[(String, Int)]
- for (i <- 0 until line._1.size()) {
- if (line._1.get(i)) tmp += ((i.toString, line._2))
- }
- tmp
- }.reduceByKey(_ + _).filter(line1 => line1._2 >= SUPPORT_NUM).cache()
- val result = fi.map(line => line._1 + ":" + line._2 / TRANSACITON_NUM)
- result.saveAsTextFile(args(1) + "/result-1")
- for (i <- 2 to K) {
- val candiateFI = getCandiateFI(fi.map(_._1).collect(), i)
- val bccFI = sc.broadcast(candiateFI)
- //To get the final frequent itemset
- fi = transactions.flatMap { line =>
- val tmp = new ArrayBuffer[(String, Int)]()
- //To check if each itemset of candiateFI in transactions
- bccFI.value.foreach { itemset =>
- val itemArray = itemset.split(",")
- var count = 0
- for (item <- itemArray) if (line._1.get(item.toInt)) count += 1
- if (count == itemArray.size) tmp += ((itemset, line._2))
- }
- tmp
- }.reduceByKey(_ + _).filter(_._2 >= SUPPORT_NUM).cache()
- val result = fi.map(line => line._1 + ":" + line._2 / TRANSACITON_NUM)
- result.saveAsTextFile(args(1) + "/result-" + i)
- bccFI.unpersist()
- }
- }
- //To get the candiate k frequent itemset from k-1 frequent itemset
- def getCandiateFI(f: Array[String], tag: Int) = {
- val separator = ","
- val arrayBuffer = ArrayBuffer[String]()
- for(i <- 0 until f.length;j <- i + 1 until f.length){
- var tmp = ""
- if(2 == tag) tmp = (f(i) + "," + f(j)).split(",").sortWith((a,b) => a.toInt <= b.toInt).reduce(_+","+_)
- else {
- if (f(i).substring(0, f(i).lastIndexOf(',')).equals(f(j).substring(0, f(j).lastIndexOf(',')))) {
- tmp = (f(i) + f(j).substring(f(j).lastIndexOf(','))).split(",").sortWith((a, b) => a.toInt <= b.toInt).reduce(_ + "," + _)
- }
- }
- var hasInfrequentSubItem = false //To filter the item which has infrequent subitem
- if (!tmp.equals("")) {
- val arrayTmp = tmp.split(separator)
- breakable {
- for (i <- 0 until arrayTmp.size) {
- var subItem = ""
- for (j <- 0 until arrayTmp.size) {
- if (j != i) subItem += arrayTmp(j) + separator
- }
- //To remove the separator "," in the end of the item
- subItem = subItem.substring(0, subItem.lastIndexOf(separator))
- if (!f.contains(subItem)) {
- hasInfrequentSubItem = true
- break
- }
- }
- } //breakable
- }
- else hasInfrequentSubItem = true
- //If itemset has no sub inftequent itemset, then put it into candiateFI
- if (!hasInfrequentSubItem) arrayBuffer += (tmp)
- } //for
- arrayBuffer.toArray
- }
- }
先写到这里,欢迎你们提出相关的建议或意见。(by老杨,转载请注明出处)