[源码解析] Flink的groupBy和reduce究竟作了什么

时间 2020-06-10

标签源码解析 flink groupby reduce 究竟作了什么繁體版

原文原文链接

[源码解析] Flink的groupBy和reduce究竟作了什么

[源码解析] Flink的groupBy和reduce究竟作了什么

0x00 摘要

Groupby和reduce是大数据领域常见的算子，可是不少同窗应该对其背后机制不甚了解。本文将从源码入手，为你们解析Flink中Groupby和reduce的原理，看看他们在背后作了什么。java

0x01 问题和归纳

1.1 问题

探究的缘由是想到了几个问题：算法

groupby的算子会对数据进行排序嘛。
groupby和reduce过程当中究竟有几回排序。
若是有多个groupby task，什么机制保证全部这些grouby task的输出中，一样的key都分配给同一个reducer。
groupby和reduce时候，有没有Rebalance 从新分配。
reduce算子会不会从新划分task。
reduce算子有没有可能和先后的其余算子组成Operator Chain。

1.2 归纳

为了便于你们理解，咱们先总结下，对于一个Groupby + Reduce的操做，Flink作了以下处理：apache

Group其实没有真实对应的算子，它只是在在reduce过程以前的一个中间步骤或者辅助步骤。
在Flink生成批处理执行计划后，有意义的结果是Reduce算子。
为了更好的reduce，Flink在reduce以前大量使用了Combine操做。Combine能够理解为是在map端的reduce的操做，对单个map任务的输出结果数据进行合并的操做。
在Flink生成批处理优化计划（Optimized Plan）以后，会把reduce分割成两段，一段是SORTED_PARTIAL_REDUCE，一段是SORTED_REDUCE。
SORTED_PARTIAL_REDUCE就是Combine。
Flink生成JobGraph以后，Flink造成了一个Operator Chain：Reduce（SORTED_PARTIAL_REDUCE）和其上游合并在一块儿。
Flink用Partitioner来保证多个 grouby task 的输出中一样的key都分配给同一个reducer。
groupby和reduce过程当中至少有三次排序：
- combine
- sort + merge
- reduce

这样以前的疑问就基本获得了解释。编程

0x02 背景概念

2.1 MapReduce细分

MapReduce是一种编程模型，用于大规模数据集的并行运算。概念 "Map（映射）"和"Reduce（归约）" 是它们的主要思想，其是从函数式编程语言，矢量编程语言里借来的特性。api

咱们目前使用的Flink，Spark都出自于MapReduce，因此咱们有必有追根溯源，看看MapReduce是如何区分各个阶段的。网络

2.2 MapReduce细分

若是把MapReduce细分，能够分为一下几大过程：数据结构

Input-Split（输入分片）：此过程是将从HDFS上读取的文件分片，而后送给Map端。有多少分片就有多少Mapper，通常分片的大小和HDFS中的块大小一致。
Shuffle-Spill（溢写）：每一个Map任务都有一个环形缓冲区。一旦缓冲区达到阈值80%，一个后台线程便开始把内容“溢写”-“spill”到磁盘。在溢写过程当中，map将继续输出到剩余的20%空间中，互不影响，若是缓冲区被填满map会被堵塞直到写磁盘完成。
Shuffle-Partition（分区）：因为每一个Map可能处理的数据量不一样，因此到达reduce有可能会致使数据倾斜。分区能够帮助咱们解决这一问题，在shuffle过程当中会按照默认key的哈希码对分区数量取余，reduce便根据分区号来拉取对应的数据，达到数据均衡。分区数量对应Reduce个数。
Shuffle-Sort（排序）：在分区后，会对此分区的数据进行内排序，排序过程会穿插在整个MapReduce中，在不少地方都存在。
Shuffle-Group（分组）：分组过程会把key相同的value分配到一个组中，wordcount程序就利用了分组这一过程。
Shuffle-Combiner（组合）：这一过程咱们能够理解为一个小的Reduce阶段，当数据量大的时候能够在map过程当中执行一次combine，这样就至关于在map阶段执行了一次reduce。因为reduce和map在不一样的节点上运行，因此reduce须要远程拉取数据，combine就能够有效下降reduce拉取数据的量，减小网络负荷（这一过程默认是不开启的，在如求平均值的mapreduce程序中不要使用combine，由于会影响结果）。
Compress（压缩）：在缓冲区溢写磁盘的时候，能够对数据进行压缩，节约磁盘空间，一样减小给reducer传递的数据量。
Reduce-Merge（合并）：reduce端会拉取各个map输出结果对应的分区文件，这样reduce端就会有不少文件，因此在此阶段，reduce再次将它们合并/排序再送入reduce执行。
Output（输出）：在reduce阶段，对已排序输出中的每一个键调用reduce函数。此阶段的输出直接写到输出文件系统，通常为HDFS。

2.3 Combine

Combine是咱们须要特殊注意的。在mapreduce中，map多，reduce少。在reduce中因为数据量比较多，因此咱们干脆在map阶段中先把本身map里面的数据归类，这样到了reduce的时候就减轻了压力。app

Combine能够理解为是在map端的reduce的操做，对单个map任务的输出结果数据进行合并的操做。combine是对一个map的，而reduce合并的对象是对于多个map。框架

map函数操做所产生的键值对会做为combine函数的输入，经combine函数处理后再送到reduce函数进行处理，减小了写入磁盘的数据量，同时也减小了网络中键值对的传输量。在Map端，用户自定义实现的Combine优化机制类Combiner在执行Map端任务的节点自己运行，至关于对map函数的输出作了一次reduce。编程语言

集群上的可用带宽每每是有限的，产生的中间临时数据量很大时就会出现性能瓶颈，所以应该尽可能避免Map端任务和Reduce端任务之间大量的数据传输。使用Combine机制的意义就在于使Map端输出更紧凑，使得写到本地磁盘和传给Reduce端的数据更少。

2.4 Partition

Partition是分割map每一个节点的结果，按照key分别映射给不一样的reduce，mapreduce使用哈希HashPartitioner帮咱们归类了。这个咱们也能够自定义。

这里其实能够理解归类。咱们对于错综复杂的数据归类。好比在动物园里有牛羊鸡鸭鹅，他们都是混在一块儿的，可是到了晚上他们就各自牛回牛棚，羊回羊圈，鸡回鸡窝。partition的做用就是把这些数据归类。只不过是在写程序的时候，

在通过mapper的运行后，咱们得知mapper的输出是这样一个key/value对： key是“aaa”， value是数值1。由于当前map端只作加1的操做，在reduce task里才去合并结果集。假如咱们知道这个job有3个reduce task，到底当前的“aaa”应该交由哪一个reduce task去作呢，是须要马上决定的。

MapReduce提供Partitioner接口，它的做用就是根据key或value及reduce task的数量来决定当前的这对输出数据最终应该交由哪一个reduce task处理。默认对key hash后再以reduce task数量取模。默认的取模方式只是为了平均reduce的处理能力，若是用户本身对Partitioner有需求，能够订制并设置到job上。

在咱们的例子中，假定 “aaa”通过Partitioner后返回0，也就是这对值应当交由第一个reducer来处理。

2.5 Shuffle

shuffle就是map和reduce之间的过程，包含了两端的combine和partition。它比较难以理解，由于咱们摸不着，看不到它。它属于mapreduce的框架，编程的时候，咱们用不到它。

Shuffle的大体范围就是：怎样把map task的输出结果有效地传送到reduce端。也能够这样理解， Shuffle描述着数据从map task输出到reduce task输入的这段过程。

2.6 Reducer

简单地说，reduce task在执行以前的工做就是不断地拉取当前job里每一个map task的最终结果，而后对从不一样地方拉取过来的数据不断地作merge，最终造成一个文件做为reduce task的输入文件。

0x03 代码

咱们以Flink的KMeans算法做为样例，具体摘要以下：

public class WordCountExampleReduce {

    DataStream ds;

    public static void main(String[] args) throws Exception {
        //构建环境
        final ExecutionEnvironment env =
                ExecutionEnvironment.getExecutionEnvironment();
        //经过字符串构建数据集
        DataSet<String> text = env.fromElements(
                "Who‘s there?",
                "I think I hear them. Stand, ho! Who‘s there?");
        //分割字符串、按照key进行分组、统计相同的key个数
        DataSet<Tuple2<String, Integer>> wordCounts = text
                .flatMap(new LineSplitter())
                .groupBy(0)
                .reduce(new ReduceFunction<Tuple2<String, Integer>>() {
                    @Override
                    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1,
                                          Tuple2<String, Integer> value2) throws Exception {
                        return new Tuple2(value1.f0, value1.f1 + value2.f1);
                    }
                });
        //打印
        wordCounts.print();
    }
    //分割字符串的方法
    public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
            for (String word : line.split(" ")) {
                out.collect(new Tuple2<String, Integer>(word, 1));
            }
        }
    }
}

输出是：

(hear,1)
(ho!,1)
(them.,1)
(I,2)
(Stand,,1)
(Who‘s,2)
(there?,2)
(think,1)

0x04 从Flink JAVA API入手挖掘

首先，咱们从Flink基本JAVA API来入手开始挖掘。

4.1 GroupBy是个辅助概念

4.1.1 Grouping

咱们须要留意的是：GroupBy并无对应的Operator。GroupBy只是生成DataSet转换的一个中间步骤或者辅助步骤。

GroupBy功能的基类是Grouping，其只是DataSet转换的一个中间步骤。其几个主要成员是：

对应的输入数据DataSet
分组所基于的keys
用户自定义的Partitioner

// Grouping is an intermediate step for a transformation on a grouped DataSet.
public abstract class Grouping<T> {
   protected final DataSet<T> inputDataSet;
   protected final Keys<T> keys;
   protected Partitioner<?> customPartitioner;
}

Grouping并无任何业务相关的API，具体API都是在其派生类中，好比UnsortedGrouping。

4.1.2 UnsortedGrouping

咱们代码中对应的就是UnsortedGrouping类。咱们看到它提供了不少业务API，好比：sum，max，min，reduce，aggregate，reduceGroup，combineGroup.....

回到咱们的示例，groupBy作了以下操做

首先，groupBy返回的就是一个UnsortedGrouping，这个UnsortedGrouping是用来转换DataSet。
其次，.groupBy(0).reduce(new CentroidAccumulator()) 返回的是ReduceOperator。这就对应了前面咱们提到的，groupBy只是中间步骤，reduce才能返回一个Operator。

public class UnsortedGrouping<T> extends Grouping<T> {
  
    // groupBy返回一个UnsortedGrouping
    public UnsortedGrouping<T> groupBy(int... fields) {
       return new UnsortedGrouping<>(this, new Keys.ExpressionKeys<>(fields, getType()));
    }
  
    // reduce返回一个ReduceOperator
 		public ReduceOperator<T> reduce(ReduceFunction<T> reducer) {
      return new ReduceOperator<T>(this, inputDataSet.clean(reducer), Utils.getCallLocationName());
    } 
}

4.2 reduce才是算子

对于业务来讲，reduce才是真正有意义的逻辑算子。

从前文的函数调用和ReduceOperator定义能够看出，.groupBy(0).reduce() 的调用结果是生成一个ReduceOperator，而 UnsortedGrouping 被设置为 ReduceOperator 的 grouper 成员变量，做为辅助操做。

public class ReduceOperator<IN> extends SingleInputUdfOperator<IN, IN, ReduceOperator<IN>> {
  
	private final ReduceFunction<IN> function;
	private final Grouping<IN> grouper; // UnsortedGrouping被设置在这里，后续reduce操做中会用到。

	public ReduceOperator(Grouping<IN> input, ReduceFunction<IN> function, 
                        String defaultName) {
		this.function = function;
		this.grouper = input; // UnsortedGrouping被设置在这里，后续reduce操做中会用到。
    this.hint = CombineHint.OPTIMIZER_CHOOSES; // 优化时候会用到。
	}
}

让咱们顺着Flink程序执行阶段继续看看系统都作了些什么。

0x05 批处理执行计划（Plan）

程序执行的第一步是：当程序运行时候，首先会根据java API的结果来生成执行plan。

public JobClient executeAsync(String jobName) throws Exception {
   final Plan plan = createProgramPlan(jobName);
}

其中重要的函数是translateToDataFlow，由于在translateToDataFlow方法中，会从批处理Java API模块中operators包往核心模块中operators包的转换。

对于咱们的示例程序，在生成 Graph时，translateToDataFlow会生成一个 SingleInputOperator，为后续runtime使用。下面是代码缩减版。

protected org.apache.flink.api.common.operators.SingleInputOperator<?, IN, ?> translateToDataFlow(Operator<IN> input) {
    
    ......
      
    // UnsortedGrouping中的keys被取出，  
		else if (grouper.getKeys() instanceof Keys.ExpressionKeys) {

			// reduce with field positions
			ReduceOperatorBase<IN, ReduceFunction<IN>> po =
					new ReduceOperatorBase<>(function, operatorInfo, logicalKeyPositions, name);

			po.setCustomPartitioner(grouper.getCustomPartitioner());
			po.setInput(input);
			po.setParallelism(getParallelism()); // 没有并行度的变化

			return po;//translateToDataFlow会生成一个 SingleInputOperator，为后续runtime使用
		}	    
  }  
}

咱们代码最终生成的执行计划以下，咱们能够看出来，执行计划基本符合咱们的估计：简单的从输入到输出。中间有意义的算子其实只有Reduce。

GenericDataSourceBase ——> FlatMapOperatorBase ——> ReduceOperatorBase ——> GenericDataSinkBase

具体在代码中体现以下是：

plan = {Plan@1296} 
 sinks = {ArrayList@1309}  size = 1
  0 = {GenericDataSinkBase@1313} "collect()"
   formatWrapper = {UserCodeObjectWrapper@1315} 
   input = {ReduceOperatorBase@1316} "ReduceOperatorBase - Reduce at main(WordCountExampleReduceCsv.java:25)"
    hint = {ReduceOperatorBase$CombineHint@1325} "OPTIMIZER_CHOOSES"
    customPartitioner = null
    input = {FlatMapOperatorBase@1326} "FlatMapOperatorBase - FlatMap at main(WordCountExampleReduceCsv.java:23)"
     input = {GenericDataSourceBase@1339} "at main(WordCountExampleReduceCsv.java:20) (org.apache.flink.api.java.io.TextInputFormat)"

0x06 批处理优化计划（Optimized Plan）

程序执行的第二步是：Flink对于Plan会继续优化，生成Optimized Plan。其核心代码位于PlanTranslator.compilePlan 函数，这里获得了Optimized Plan。

这个编译的过程不做任何决策与假设，也就是说做业最终如何被执行早已被优化器肯定，而编译也是在此基础上作肯定性的映射。因此咱们将集中精力看如何优化plan。

private JobGraph compilePlan(Plan plan, Configuration optimizerConfiguration) {
   Optimizer optimizer = new Optimizer(new DataStatistics(), optimizerConfiguration);
   OptimizedPlan optimizedPlan = optimizer.compile(plan);

   JobGraphGenerator jobGraphGenerator = new JobGraphGenerator(optimizerConfiguration);
   return jobGraphGenerator.compileJobGraph(optimizedPlan, plan.getJobId());
}

在内部调用plan的accept方法遍历它。accept会挨个在每一个sink上调用accept。对于每一个sink会先preVisit，而后 postVisit。

这里优化时候有几个注意点：

在 GraphCreatingVisitor.preVisit 中，当发现Operator是 ReduceOperatorBase 类型的时候，会创建ReduceNode。
```
else if (c instanceof ReduceOperatorBase) {
   n = new ReduceNode((ReduceOperatorBase<?, ?>) c);
}
```

ReduceNode是Reducer Operator的Optimizer表示。

public class ReduceNode extends SingleInputNode {
	private final List<OperatorDescriptorSingle> possibleProperties;	
	private ReduceNode preReduceUtilityNode;
}

生成ReduceNode时候，会根据以前提到的 hint 来决定 combinerStrategy = DriverStrategy.SORTED_PARTIAL_REDUCE;

public ReduceNode(ReduceOperatorBase<?, ?> operator) {
			DriverStrategy combinerStrategy;
			switch(operator.getCombineHint()) {
				case OPTIMIZER_CHOOSES:
					combinerStrategy = DriverStrategy.SORTED_PARTIAL_REDUCE;
					break;
      }  
}

生成的优化执行计划以下，咱们能够看到，这时候设置了并行度，也把reduce分割成两段，一段是SORTED_PARTIAL_REDUCE，一段是SORTED_REDUCE。

Data Source  ——> FlatMap ——> Reduce(SORTED_PARTIAL_REDUCE)   ——> Reduce(SORTED_REDUCE)  ——> Data Sink

具体在代码中体现以下是：

optimizedPlan = {OptimizedPlan@1506} 
 
 allNodes = {HashSet@1510}  size = 5
   
  0 = {SourcePlanNode@1512} "Data Source "at main(WordCountExampleReduceCsv.java:20) (org.apache.flink.api.java.io.TextInputFormat)" : NONE [[ GlobalProperties [partitioning=RANDOM_PARTITIONED] ]] [[ LocalProperties [ordering=null, grouped=null, unique=null] ]]"
   parallelism = 4

  1 = {SingleInputPlanNode@1513} "FlatMap "FlatMap at main(WordCountExampleReduceCsv.java:23)" : FLAT_MAP [[ GlobalProperties [partitioning=RANDOM_PARTITIONED] ]] [[ LocalProperties [ordering=null, grouped=null, unique=null] ]]"
   parallelism = 4

  2 = {SingleInputPlanNode@1514} "Reduce "Reduce at main(WordCountExampleReduceCsv.java:25)" : SORTED_REDUCE [[ GlobalProperties [partitioning=RANDOM_PARTITIONED] ]] [[ LocalProperties [ordering=null, grouped=null, unique=null] ]]"
   parallelism = 4

  3 = {SinkPlanNode@1515} "Data Sink "collect()" : NONE [[ GlobalProperties [partitioning=RANDOM_PARTITIONED] ]] [[ LocalProperties [ordering=null, grouped=null, unique=null] ]]"
   parallelism = 4

  4 = {SingleInputPlanNode@1516} "Reduce "Reduce at main(WordCountExampleReduceCsv.java:25)" : SORTED_PARTIAL_REDUCE [[ GlobalProperties [partitioning=RANDOM_PARTITIONED] ]] [[ LocalProperties [ordering=null, grouped=null, unique=null] ]]"
   parallelism = 4

0x07 JobGraph

程序执行的第三步是：创建JobGraph。LocalExecutor.execute中会生成JobGraph。Optimized Plan 通过优化后生成了 JobGraph， JobGraph是提交给 JobManager 的数据结构。

主要的优化为，将多个符合条件的节点 chain 在一块儿做为一个节点，这样能够减小数据在节点之间流动所须要的序列化/反序列化/传输消耗。

JobGraph是惟一被Flink的数据流引擎所识别的表述做业的数据结构，也正是这一共同的抽象体现了流处理和批处理在运行时的统一。

public CompletableFuture<JobClient> execute(Pipeline pipeline, Configuration configuration) throws Exception {
   final JobGraph jobGraph = getJobGraph(pipeline, configuration);
}

咱们能够看出来，这一步造成了一个Operator Chain：

CHAIN DataSource -> FlatMap -> Combine (Reduce)

因而咱们看到，Reduce(SORTED_PARTIAL_REDUCE)和其上游合并在一块儿。

具体在程序中打印出来：

jobGraph = {JobGraph@1739} "JobGraph(jobId: 30421d78d7eedee6be2c5de39d416eb7)"
 taskVertices = {LinkedHashMap@1742}  size = 3
  
  {JobVertexID@1762} "e2c43ec0df647ea6735b2421fb7330fb" -> {InputOutputFormatVertex@1763} "CHAIN DataSource (at main(WordCountExampleReduceCsv.java:20) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCountExampleReduceCsv.java:23)) -> Combine (Reduce at main(WordCountExampleReduceCsv.java:25)) (org.apache.flink.runtime.operators.DataSourceTask)"
  
  {JobVertexID@1764} "2de11f497e827e48dda1d63b458dead7" -> {JobVertex@1765} "Reduce (Reduce at main(WordCountExampleReduceCsv.java:25)) (org.apache.flink.runtime.operators.BatchTask)"
  
  {JobVertexID@1766} "2bee17f2c86aa1e9439e3dedea58007b" -> {InputOutputFormatVertex@1767} "DataSink (collect()) (org.apache.flink.runtime.operators.DataSinkTask)"

0x08 Runtime

Job提交以后，就是程序正式运行了。这里实际上涉及到了三次排序，

一次是在FlatMap发送时候调用到了ChainedReduceCombineDriver.sortAndCombine。这部分对应了咱们以前提到的MapReduce中的Combine和Partition。
一次是在 ReduceDriver 所在的 BatchTask中，由UnilateralSortMerger完成了sort & merge操做。
一次是在ReduceDriver，这里作了最后的reducer排序。

8.1 FlatMap

这里是第一次排序。

当一批数据处理完成以后，在ChainedFlatMapDriver中调用到close函数进行发送数据给下游。

public void close() {
   this.outputCollector.close();
}

Operator Chain会调用到ChainedReduceCombineDriver.close

public void close() {
   // send the final batch
   try {
      switch (strategy) {
         case SORTED_PARTIAL_REDUCE:
            sortAndCombine(); // 咱们是在这里
            break;
         case HASHED_PARTIAL_REDUCE:
            reduceFacade.emit();
            break;
      }
   } catch (Exception ex2) {
      throw new ExceptionInChainedStubException(taskName, ex2);
   }

   outputCollector.close();
   dispose(false);
}

8.1.1 Combine

sortAndCombine中先排序，而后作combine，最后会不断发送数据。

private void sortAndCombine() throws Exception {
   final InMemorySorter<T> sorter = this.sorter;

   if (!sorter.isEmpty()) {
      sortAlgo.sort(sorter); // 这里会先排序

      final TypeSerializer<T> serializer = this.serializer;
      final TypeComparator<T> comparator = this.comparator;
      final ReduceFunction<T> function = this.reducer;
      final Collector<T> output = this.outputCollector;
      final MutableObjectIterator<T> input = sorter.getIterator();

      if (objectReuseEnabled) {
        ......
      } else {
         T value = input.next();

         // 这里就是combine
         // iterate over key groups
         while (running && value != null) {
            comparator.setReference(value);
            T res = value;

            // iterate within a key group
            while ((value = input.next()) != null) {
               if (comparator.equalToReference(value)) {
                  // same group, reduce
                  res = function.reduce(res, value);
               } else {
                  // new key group
                  break;
               }
            }

            output.collect(res); //发送数据
         }
      }
   }
}

8.1.2 Partition

最后发送给哪一个下游，是由OutputEmitter.selectChannel决定的。有以下几种决定方式：

hash-partitioning, broadcasting, round-robin, custom partition functions。这里采用的是PARTITION_HASH。

每一个task都会把一样字符串统计结果发送给一样的下游ReduceDriver。这就保证了下游Reducer必定不会出现统计出错。

public final int selectChannel(SerializationDelegate<T> record) {
   switch (strategy) {
   ...
   case PARTITION_HASH:
      return hashPartitionDefault(record.getInstance(), numberOfChannels);
   ...
   }
}

private int hashPartitionDefault(T record, int numberOfChannels) {
	int hash = this.comparator.hash(record);
	return MathUtils.murmurHash(hash) % numberOfChannels;
}

具体调用栈：

hash:50, TupleComparator (org.apache.flink.api.java.typeutils.runtime)
hash:30, TupleComparator (org.apache.flink.api.java.typeutils.runtime)
hashPartitionDefault:187, OutputEmitter (org.apache.flink.runtime.operators.shipping)
selectChannel:147, OutputEmitter (org.apache.flink.runtime.operators.shipping)
selectChannel:36, OutputEmitter (org.apache.flink.runtime.operators.shipping)
emit:60, ChannelSelectorRecordWriter (org.apache.flink.runtime.io.network.api.writer)
collect:65, OutputCollector (org.apache.flink.runtime.operators.shipping)
collect:35, CountingCollector (org.apache.flink.runtime.operators.util.metrics)
sortAndCombine:254, ChainedReduceCombineDriver (org.apache.flink.runtime.operators.chaining)
close:266, ChainedReduceCombineDriver (org.apache.flink.runtime.operators.chaining)
close:40, CountingCollector (org.apache.flink.runtime.operators.util.metrics)
close:88, ChainedFlatMapDriver (org.apache.flink.runtime.operators.chaining)
invoke:215, DataSourceTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

8.2 UnilateralSortMerger

这里是第二次排序。

在 BatchTask中，会先Sort, Merge输入，而后才会交由Reduce来具体完成过。sort & merge操做具体是在UnilateralSortMerger类中完成的。

getIterator:646, UnilateralSortMerger (org.apache.flink.runtime.operators.sort)
getInput:1110, BatchTask (org.apache.flink.runtime.operators)
prepare:95, ReduceDriver (org.apache.flink.runtime.operators)
run:474, BatchTask (org.apache.flink.runtime.operators)
invoke:369, BatchTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

UnilateralSortMerger是一个full fledged sorter，它实现了一个多路merge sort。其内部的逻辑被划分到三个线程上(read, sort, spill)，这三个线程彼此之间经过一系列blocking queues来构成了一个闭环。

其内存经过MemoryManager分配，因此这个组件不会超过给其分配的内存。

该类主要变量摘录以下：

public class UnilateralSortMerger<E> implements Sorter<E> {
	// ------------------------------------------------------------------------
	//                                  Threads
	// ------------------------------------------------------------------------

	/** The thread that reads the input channels into buffers and passes them on to the merger. */
	private final ThreadBase<E> readThread;

	/** The thread that merges the buffer handed from the reading thread. */
	private final ThreadBase<E> sortThread;

	/** The thread that handles spilling to secondary storage. */
	private final ThreadBase<E> spillThread;
	
	// ------------------------------------------------------------------------
	//                                   Memory
	// ------------------------------------------------------------------------
	
	/** The memory segments used first for sorting and later for reading/pre-fetching
	 * during the external merge. */
	protected final List<MemorySegment> sortReadMemory;
	
	/** The memory segments used to stage data to be written. */
	protected final List<MemorySegment> writeMemory;
	
	/** The memory manager through which memory is allocated and released. */
	protected final MemoryManager memoryManager;
	
	// ------------------------------------------------------------------------
	//                            Miscellaneous Fields
	// ------------------------------------------------------------------------
	/**
	 * Collection of all currently open channels, to be closed and deleted during cleanup.
	 */
	private final HashSet<FileIOChannel> openChannels;
	
	/**
	 * The monitor which guards the iterator field.
	 */
	protected final Object iteratorLock = new Object();
	
	/**
	 * The iterator to be returned by the sort-merger. This variable is null, while receiving and merging is still in
	 * progress and it will be set once we have &lt; merge factor sorted sub-streams that will then be streamed sorted.
	 */
	protected volatile MutableObjectIterator<E> iterator; 	// 若是你们常常调试，就会发现driver中的input都是这个兄弟。

	private final Collection<InMemorySorter<?>> inMemorySorters;
}

8.2.1 三种线程

ReadingThread：这种线程持续读取输入，而后把数据放入到一个待排序的buffer中。The thread that consumes the input data and puts it into a buffer that will be sorted.

SortingThread : 这种线程对于上游填充好的buffer进行排序。The thread that sorts filled buffers.

SpillingThread：这种线程进行归并操做。The thread that handles the spilling of intermediate results and sets up the merging. It also merges the channels until sufficiently few channels remain to perform the final streamed merge.

8.2.2 MutableObjectIterator

UnilateralSortMerger有一个特殊变量：

protected volatile MutableObjectIterator<E> iterator;

这个变量就是最终sort-merger的输出。若是你们调试过算子，就会发现这个变量就是具体算子的输入input类型。最终算子的输入就是来自于此。

8.3 ReduceDriver

这里是第三次排序，咱们能够看出来reduce是怎么和groupby一块儿运做的。

针对 .groupBy(0)，ReduceDriver就是单纯获取输入的第一个数值 T value = input.next();
后续代码中有嵌套的两个while，分别是：遍历各类key，以及某一key中reduce。
遍历 group keys的时候，把value赋于比较算子comparator(这个算子概念不是Flink算子，就是为了说明逻辑概念) comparator.setReference(value); 由于groubBy只是指定按照第一个位置比较，没有指定具体key数值，因此这个value就是key了。此处记为while (1) ，代码中有注解。
从输入中读取后续的数值value，若是下一个数值是同一个key，就reduce；若是下一个数值不是同一个key，就跳出循环。放弃比较，把reduce结果输出。此处记为 while (2)
跳出 while (2) 以后，代码依然在 while (1) ，此时value是新值，因此继续在 while (1)中运行。把value继续赋于比较算子 comparator.setReference(value);，因而进行新的key比较。

public class ReduceDriver<T> implements Driver<ReduceFunction<T>, T> {
	@Override
	public void run() throws Exception {

		final Counter numRecordsIn = this.taskContext.getMetricGroup().getIOMetricGroup().getNumRecordsInCounter();
		final Counter numRecordsOut = this.taskContext.getMetricGroup().getIOMetricGroup().getNumRecordsOutCounter();

		// cache references on the stack
		final MutableObjectIterator<T> input = this.input;
		final TypeSerializer<T> serializer = this.serializer;
		final TypeComparator<T> comparator = this.comparator;		
		final ReduceFunction<T> function = this.taskContext.getStub();		
		final Collector<T> output = new CountingCollector<>(this.taskContext.getOutputCollector(), numRecordsOut);

		if (objectReuseEnabled) {
      ......
		} else {
      // 针对 `.groupBy(0)`，ReduceDriver就是单纯获取输入的第一个数值 `T value = input.next();`
			T value = input.next();

      // while (1)
			// iterate over key groups
			while (this.running && value != null) {
				numRecordsIn.inc();
        // 把value赋于比较算子，这个value就是key了。
				comparator.setReference(value);
				T res = value;

        // while (2)
				// iterate within a key group，循环比较这个key
				while ((value = input.next()) != null) {
					numRecordsIn.inc();
					if (comparator.equalToReference(value)) {
						// same group, reduce，若是下一个数值是同一个key，就reduce
						res = function.reduce(res, value);
					} else {
						// new key group，若是下一个数值不是同一个key，就跳出循环，放弃比较。
						break;
					}
				}
        // 把reduce结果输出
				output.collect(res);
			}
		}
	}  
}

0x09 参考

mapreduce里的shuffle 里的 sort merge 和combine

实战录 | Hadoop Mapreduce shuffle之Combine探讨

Hadoop中MapReduce中combine、partition、shuffle的做用是什么？在程序中怎么运用？

Flink运行时之生成做业图

mapreduce过程