Apache beam其余学习记录

时间 2019-11-11

原文原文链接

Combine与GroupByKey

GroupByKey是把相关key的元素聚合到一块儿，一般是造成一个Iterable的value，如：安全

cat, [1,5,9]
dog, [5,2]
and, [1,2,6]

Combine是对聚合后的Iterable进行处理（如求和，求均值），返回一个结果。内置的Combine.perKey（）方法实际上是GroupByKey和Combine的结合，先聚合和处理。
Beam中还有许多内置的处理类，好比Sum.integersPerKey（），Count.perElement()等
在全局窗口下，对于空输入，Combine操做后通常会返回默认值（好比Sum的默认返回值为0），若是设置了.withoutDefault（），则返回空的PCollection。
在非全局窗口下，用户必须指明空输入时的返回类型，若是Combine的输出结果要做为下一级处理的输入，通常设置为.asSingletonView（），表示返回默认值，这样即便空窗口也有默认值返回，保证了窗口的数量不变；若是设置了.withoutDefault(),则空的窗口返回空PCollection，通常做为最后的输出结果。app

Platten与Patition

用于PCollection与PCollectionList的转换。
官方文档给的Platten代码很容易理解：ide

// Flatten takes a PCollectionList of PCollection objects of a given type.
// Returns a single PCollection that contains all of the elements in the PCollection objects in that list.
PCollection<String> pc1 = ...;
PCollection<String> pc2 = ...;
PCollection<String> pc3 = ...;
PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3);

PCollection<String> merged = collections.apply(Flatten.<String>pCollections());

将一个PCollectionList={ PCollection{String1}， PCollection{String2}， PCollection{String3} }转换为一个PCollection={String1， String2， String3}.
而Patition恰好反过来，要将PCollection转换为PCollectionList须要指明分红的list长度以及如何划分，所以须要传递划分长度size和划分方法Fn。编码

// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
    students.apply(Partition.of(10, new PartitionFn<Student>() {
        public int partitionFor(Student student, int numPartitions) {
            return student.getPercentile()  // 0..99
                 * numPartitions / 100;
        }}));

其中partitionFor（）方法返回的是在PCollectionList中的位置下标。线程

Side Input

不能使用硬编码数据，一般是转换中间产生的数据。通常用于跟主输入数据进行比较，所以要求Side Input数据的窗口要与主输入数据的窗口尽可能一致，若是不一致，Beam会尽量地从Side Input中找到合适的位置的数据进行比较。对于设置了多个触发器的Side Input，自动选择最后一个触发的结算结果。code

附属输出数据 Additional Outputs

这一部分官方的代码已经写得很清楚，看代码便可。对象

数据编码

在Pipeline的数据处理过程当中常常须要对数据元素进行字节转换，所以须要制定字节转换的编码格式。对于绝大部分类型的数据，Beam都提供了默认的编码类型，用户也能够经过SetCoder指定编码类型。
1）从内存读取的输入数据通常要求用户指定其编码类型；
2）用户自定义的类对象通常要求用户指定其编码类型，或者能够在类定义上使用@DefaultCoder（AvroCoder.class)指定默认编码类型。ip

其余：

Beam不是线程安全的，通常建议处理方法是幂等的。内存