spark combineByKey

时间 2019-11-10

原文原文链接

查看源代码会发现combineByKey定义以下：app

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)
    : RDD[(K, C)] = {
    combineByKey(createCombiner, mergeValue, mergeCombiners, defaultPartitioner(self))
  }

例子：ide

spark分组计算平均值函数

object ColumnValueAvg extends App {
  /**
    * ID,Name,ADDRESS,AGE
    * 001,zhangsan,chaoyang,20
    * 002,zhangsa,chaoyang,27
    * 003,zhangjie,chaoyang,35
    * 004,lisi,haidian,24
    * 005,lier,haidian,40
    * 006,wangwu,chaoyang,90
    * 007,wangchao,haidian,80
    */
  val conf = new SparkConf().setAppName("test column value sum and avg").setMaster("local[1]")
  val sc = new SparkContext(conf)

  val textRdd = sc.textFile(args(0))

  //be careful the toInt here is necessary ,if no cast ,then it will be age string append
  val addressAgeMap = textRdd.map(x => (x.split(",")(2), x.split(",")(3).toInt))

  val sumAgeResult = addressAgeMap.reduceByKey(_ + _).collect().foreach(println)

  val avgAgeResult = addressAgeMap.combineByKey(
    (v) => (v, 1),
    (accu: (Int, Int), v) => (accu._1 + v, accu._2 + 1),
    (accu1: (Int, Int), accu2: (Int, Int)) => (accu1._1 + accu2._1, accu1._2 + accu2._2)
  ).mapValues(x => (x._1 / x._2).toDouble).collect().foreach(println)

  println("Sum and Avg calculate successfuly")

  sc.stop()

}

combineByKey函数须要传递三个函数作为参数，分别为createCombiner、mergeValue、mergeCombiner,须要理解这三个函数的意义性能

结合数据来说的话，combineByKey默认按照key来进行元素的combine,这里三个参数都是对value的一些操做spa

1>第一个参数createCombiner,如代码中定义的是 : (v) => (v, 1)code

这里是建立了一个combiner,做用是当遍历rdd的分区时，遇到第一次出现的key值，那么生成一个(v,1)的combiner,好比这里key为address,当遇到第一个ip

chaoyang,20 的时候，(v,1)中的v就是age的值20，1是address出现的次数

2>第2个参数是mergeValue,顾名思义就是合并value,如代码中定义的是:(accu: (Int, Int), v) => (accu._1 + v, accu._2 + 1)
这里的做用是当处理当前分区时，遇到已经出现过的key,那么合并combiner中的value,注意这里accu: (Int, Int)对应第一个参数中出现的combiner,即(v,1),注意类型要一致
那么(accu._1 + v, accu._2 + 1)就很好理解了，accu._1即便须要合并的age的值，而acc._2是须要合并的key值出现的次数,出现一次即加1

3>第三个参数是mergeCombiners,用来合并各个分区上的累加器，由于各个分区分别运行了前2个函数后须要最后合并分区结果.

ok,运行代码，结果以下,分别按照address来计算出age的平均值

(haidian,48.0)
(chaoyang,43.0)

因为combineByKey抽象程度很高，能够本身custom一些函数作为计算因子，所以能够灵活的完成更多的计算功能.reduceByKey、groupByKey都是基于combineByKey实现的。字符串

combineByKey

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K, C)]

1，第一个参数，createCombiner: V => C。。这个表示当combineByKey第一次遇到值为k的key时，调用createCombiner函数，将V转换为C。 (这一步相似于初始化操做)
2，第二个参数，mergeValue: (C, V) => C。。这个表示当combineByKey不是第一次遇到值为k的Key时，调用mergeValue函数，将v累加到c中。。(这个操做在每一个分区内进行)
3，第三个参数，mergeCombiners: (C, C) => C。这个表示将两个C合并为一个C类型。 (这个操做在不一样分区间进行)
4，算子的返回值最后为RDD[(K,C)]类型。表示根据相同的k，将value值由原来的V类型最后转换为C类型。string

val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)  //利用拉练操做将两个rdd合并为一个值为pair类型的rdd。
 
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
//在这个combineByKey中，能够看到首先每次遇到第一个值，就将其变为一个加入到一个List中去。
//第二个函数指的是在key相同的状况下，当每次遇到新的value值，就把这个值添加到这个list中去。
//最后是一个merge函数，表示将key相同的两个list进行合并。
 
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))
val initialScores = Array(("Fred", 88.0), ("Fred", 95.0), ("Fred", 91.0), ("Wilma", 93.0), ("Wilma", 95.0), ("Wilma", 98.0))  
val d1 = sc.parallelize(initialScores)  
type MVType = (Int, Double) //定义一个元组类型(科目计数器,分数)  。type的意思是之后再这个代码中全部的类型为(Int, Double)均可以被记为MVType。
d1.combineByKey(  
  score => (1, score),  
  //score => (1, score)，咱们把分数做为参数,并返回了附加的元组类型。 以"Fred"为列，当前其分数为88.0 =>(1,88.0)  1表示当前科目的计数器，此时只有一个科目
  (c1: MVType, newScore) => (c1._1 + 1, c1._2 + newScore),  
  //注意这里的c1就是createCombiner初始化获得的(1,88.0)。在一个分区内，咱们又碰到了"Fred"的一个新的分数91.0。固然咱们要把以前的科目分数和当前的分数加起来即//c1._2 + newScore,而后把科目计算器加1即c1._1 + 1
 
  (c1: MVType, c2: MVType) => (c1._1 + c2._1, c1._2 + c2._2)  
  //注意"Fred"多是个学霸,他选修的科目可能过多而分散在不一样的分区中。全部的分区都进行mergeValue后,接下来就是对分区间进行合并了,分区间科目数和科目数相加分数和分数相加就获得了总分和总科目数
).map 
{ 
case (name, (num, socre)) 
=> (name, socre / num)
 }.collect

reduceByKey

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

做用于键值对类型的数据，根据有相同键的数据，进行汇总。传入一个函数，这个函数做用于有两个相同的key的键值对，而后对value值进行函数操做it

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x)) //生成一个键值对类型的数据，键为字符串长度，值为字符串。
b.reduceByKey(_ + _).collect  //对于有相同的键的元祖进行累加，因为全部的数据的长度都是3，因此最后获得了以下的结果
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))
 
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x)) //一样的，将数据变为元祖。
b.reduceByKey(_ + _).collect //长度为3的数据有dog，cat，长度为4的数据有lion。长度为5的有tiger和eagle。长度为7的有一个panther。

groupByKey

def groupByKey(): RDD[(K, Iterable[V])]  //讲一个rdd进行有键值，进行group操做，最后返回的value值是一个迭代器，其内容包含全部key值为K的元祖的value值。
 
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length) //keyBy算子的意思是以_.length这个值做为key。其中value的返回值为ArrayBuffer。
b.groupByKey().collect()
 
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))  //

groupByKey数据没有进行合并，因此性能最低。