Spark算子篇 --Spark算子之aggregateByKey详解

时间 2019-11-06

标签 spark 算子 aggregatebykey 详解栏目 Spark 繁體版

原文原文链接

一。基本介绍函数

rdd.aggregateByKey(3, seqFunc, combFunc) 其中第一个函数是初始值 ui

3表明每次分完组以后的每一个组的初始值。spa

seqFunc表明combine的聚合逻辑rest

每个mapTask的结果的聚合成为combinecode

combFunc reduce端大聚合的逻辑blog

ps:aggregateByKey默认分组it

二。代码spark

from pyspark import SparkConf,SparkContext
from __builtin__ import str
conf = SparkConf().setMaster("local").setAppName("AggregateByKey")
sc = SparkContext(conf = conf)

rdd = sc.parallelize([(1,1),(1,2),(2,1),(2,3),(2,4),(1,7)],2)

def f(index,items):
    print "partitionId:%d" %index
    for val in items:
        print val
    return items
    
rdd.mapPartitionsWithIndex(f, False).count()


def seqFunc(a,b):
    print "seqFunc:%s,%s" %(a,b)
    return max(a,b) #取最大值
def combFunc(a,b):
    print "combFunc:%s,%s" %(a ,b)
    return a + b #累加起来
'''
    aggregateByKey这个算子内部确定有分组
'''
aggregateRDD = rdd.aggregateByKey(3, seqFunc, combFunc)
rest = aggregateRDD.collectAsMap()
for k,v in rest.items():
    print k,v

sc.stop()

三。详细逻辑io

PS：ast

seqFunc函数 combine篇。

3是每一个分组的最大值，因此把3传进来，在combine函数中也就是seqFunc中第一次调用 3表明a,b即1,max(a,b)即3 第二次再调用则max(3.1)中的最大值3即输入值，2即b值因此结果则为(1,3)

底下相似。combine函数调用的次数与分组内的数据个数一致。

combFunc函数 reduce聚合

在reduce端大聚合，拉完数据后也是先分组，而后再调用combFunc函数

四。结果

持续更新中。。。。，欢迎你们关注个人公众号LHWorld.