——————————————
版权声明:本文为博主「henyu」的原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处连接及本声明。
原文连接:https://i.cnblogs.com/EditPosts.aspx?postid=11430012
html
一 、概述
在大数据的浪潮之下,技术的更新迭代十分频繁。受技术开源的影响,大数据开发者提供了十分丰富的工具。但也由于如此,增长了开发者选择合适工具的难度。在大数据处理一些问题的时候,每每使用的技术是多样化的。这彻底取决于业务需求,好比进行批处理的MapReduce,实时流处理的Flink,以及SQL交互的Spark SQL等等。而把这些开源框架,工具,类库,平台整合到一块儿,所须要的工做量以及复杂度,可想而知。这也是大数据开发者比较头疼的问题。而今天要分享的就是整合这些资源的一个解决方案,它就是 Apache Beam。java
Beam是一个统一的编程框架,支持批处理和流处理,并能够将用Beam编程模型构造出来的程序,在多个计算引擎(Apache Apex, Apache Flink, Apache Spark, Google Cloud Dataflow等)上运行。node
本文重点不在于讲解 apache beam 的优缺点及应用前景,着重在于为初识beam ,而不知道怎么入门编写代码的朋友抛转引玉。python
网上关于apache beam 的介绍不少,在这里我就不介绍了,有兴趣的可参阅下面连接apache
https://blog.csdn.net/qq_34777600/article/details/87165765 (原文出自: 一只IT小小鸟)编程
http://www.javashuo.com/article/p-mpwqqzle-gu.html (来源于 张海涛,目前就任于海康威视云基础平台,负责云计算大数据的基础架构设计和中间件的开发,专一云计算大数据方向。Apache Beam 中文社区发起人之一,若是想进一步了解最新 Apache Beam 动态和技术研究成果,请加微信 cyrjkj 入群共同研究和运用)微信
三 、代码入门架构
示例一 、读写文件 TextIOapp
<dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-sdks-java-core</artifactId> <version>${beam.version}</version> <!--<scope>provided</scope>--> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-direct-java --> <dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-runners-direct-java</artifactId> <version>${beam.version}</version> <!--<scope>provided</scope>--> </dependency>
<dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-runners-core-java</artifactId> <version>${kafka.version}</version> <!--<scope>provided</scope>--> </dependency>
/** * 读写文件 TextIO * * @param */ public static void TextIo() { //建立管道工厂 PipelineOptions pipelineOptions = PipelineOptionsFactory.create(); //设置运行的模型,如今一共有3种 pipelineOptions.setRunner(DirectRunner.class); //设置相应的管道 Pipeline pipeline = Pipeline.create(pipelineOptions); //根据文件路径读取文件内容 pipeline.apply(TextIO.read().from("C:\\bigdata\\apache_beam\\src\\main\\resources\\abc")) .apply("ExtractWords", ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { //根据空格读取数据 for (String word : c.element().split(" ")) { if (!word.isEmpty()) { c.output(word); System.out.println("读文件中的数据:" + word); } } } })).apply(Count.<String>perElement()) .apply("formatResult", MapElements.via(new SimpleFunction<KV<String, Long>, String>() { @Override public String apply(KV<String, Long> input) { return input.getKey() + " : " + input.getValue(); } })) .apply(TextIO.write().to("C:\\bigdata\\apache_beam\\src\\main\\resources")); //进行输出到文件夹下面 pipeline.run().waitUntilFinish(); }
示例2、启用flink做为计算引擎、整合kafka ,以流式数据窗口的方式,计算kafka数据框架
引入相关依赖
<dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-sdks-java-core</artifactId> <version>${beam.version}</version> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-sdks-java-io-kafka</artifactId> <version>${beam.version}</version> <!--<scope>provided</scope>--> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients --> <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>${kafka.version}</version> <!--<scope>provided</scope>--> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.beam/beam-runners-core-java --> <dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-runners-core-java</artifactId> <version>${kafka.version}</version> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>${flink.version}</version> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients_2.11</artifactId> <version>${flink.version}</version> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-core</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-runtime_2.11</artifactId> <version>${flink.version}</version> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.11</artifactId> <version>${flink.version}</version> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-metrics-core</artifactId> <version>${flink.version}</version> <!--<scope>provided</scope>--> </dependency>
核心代码 :
/** * flink * 读写kafka数据 * flinkRunner * @param */ public static void flinkKafka() { FlinkPipelineOptions options = PipelineOptionsFactory.as(FlinkPipelineOptions.class); // 显式指定PipelineRunner:FlinkRunner必须指定若是不制定则为本地 options.setStreaming(true); options.setAppName("app_test"); options.setJobName("flinkjob"); options.setFlinkMaster("local"); options.setParallelism(10); //建立flink管道 Pipeline pipeline = Pipeline.create(options); //指定KafkaIO的模型,从源码中不难看出这个地方的KafkaIO<K,V>类型是String和String 类型,也能够换成其余类型。 PCollection<KafkaRecord<String, String>> lines = pipeline.apply(KafkaIO.<String, String>read() //设置Kafka集群的集群地址 .withBootstrapServers(kafkaBootstrapServers) //设置Kafka的主题类型,源码中使用了单个主题类型,若是是多个主题类型则用withTopics(List<String>)方法进行设置。 // 设置状况基本跟Kafka原生是同样的 .withTopic(inputTopic) //设置序列化类型 .withKeyDeserializer(StringDeserializer.class) .withValueDeserializer(StringDeserializer.class) //设置Kafka的消费者属性,这个地方还能够设置其余的属性。源码中是针对消费分组进行设置。 .withConsumerConfigUpdates(ImmutableMap.<String, Object>of("auto.offset.reset", "latest")) /*//设置Kafka吞吐量的时间戳,能够是默认的,也能够自定义 .withLogAppendTime() *//** * 至关于Kafka 中"isolation.level", "read_committed" ,指定KafkaConsumer只应读取非事务性消息,或从其输入主题中提交事务性消息。 * 流处理应用程序一般在多个读取处理写入阶段处理其数据,每一个阶段使用前一阶段的输出做为其输入。 * 经过指定read_committed模式,咱们能够在全部阶段完成一次处理。针对"Exactly-once" 语义,支持Kafka 0.11版本。 *//* .withReadCommitted() //设置Kafka是否自动提交属性"AUTO_COMMIT",默认为自动提交,使用Beam 的方法来设置 .commitOffsetsInFinalize() //设置是否返回Kafka的其余数据,例如offset 信息和分区信息,不用能够去掉 .withoutMetadata() //设置只返回values值,不用返回key*/ ); //kafka数据获取 PCollection<String> kafkadata = lines.apply("Remove Kafka Metadata", ParDo.of(new DoFn<KafkaRecord<String, String>, String>() { @ProcessElement public void processElement(ProcessContext c) { System.out.println("输出的分区为----:" + c.element().getKV()); c.output(c.element().getKV().getValue()); } })); //kafka数据处理 PCollection<String> wordCount = kafkadata .apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5)))) .apply(Count.<String>perElement()) .apply("ConcatResultKV", MapElements.via(new SimpleFunction<KV<String, Long>, String>() { // 拼接最后的格式化输出(Key为Word,Value为Count) @Override public String apply(KV<String, Long> input) { System.out.println("进行统计:" + input.getKey() + ": " + input.getValue()); return input.getKey() + ": " + input.getValue(); } })); //kafka 处理后的数据发送回kafka wordCount.apply(KafkaIO.<Void, String>write() .withBootstrapServers(kafkaBootstrapServers) .withTopic(outputTopic) //不须要设置,类型为void // .withKeySerializer(VoidDeserializer.class) .withValueSerializer(StringSerializer.class) .values() ); pipeline.run().waitUntilFinish(); }
示例三 :spark做为runner ,读取kafka流式数据,窗口时间,处理结果放回kafka
依赖 ,将示例二差很少
<dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-runners-spark</artifactId> <version>${beam.version}</version> </dependency>
核心代码
/** * 采用spark 做为runner * 消费kafka数据 */ public static void sparkKafka() { //建立管道工厂 SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class); //参数设置 options.setSparkMaster("local[*]"); options.setAppName("spark-beam"); options.setCheckpointDir("/user/chickpoint16"); //建立管道 Pipeline pipeline = Pipeline.create(options); //读取kafka数据 PCollection<KafkaRecord<String, String>> lines = pipeline.apply(KafkaIO.<String, String>read() //设置kafka地址 .withBootstrapServers(kafkaBootstrapServers) //设置链接主题 .withTopic(inputTopic) //设置序列化 .withKeyDeserializer(StringDeserializer.class) .withValueDeserializer(StringDeserializer.class) //设置Kafka的消费者属性,这个地方还能够设置其余的属性。源码中是针对消费分组进行设置。 .withConsumerConfigUpdates(ImmutableMap.<String, Object>of("auto.offset.reset", " latest")) ); //数据处理 PCollection<String> wordcount = lines.apply("split data",ParDo.of(new DoFn<KafkaRecord<String, String>,String>() { @ProcessElement public void processElement(ProcessContext c){ String[] arr=c.element().getKV().getValue().split(" "); for(String value :arr){ if(!value.isEmpty()){ c.output(value); } } } })).apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5)))) .apply(Count.<String>perElement()) .apply("wordcount",MapElements.via(new SimpleFunction<KV<String,Long>,String>(){ @Override public String apply(KV<String,Long> input){ System.out.println(input.getKey()+" : "+input.getValue()); System.err.println("==============================================="); return input.getKey()+" : "+input.getValue(); } })); System.out.println(wordcount); //kafka 处理后的数据发送回kafka wordcount.apply(KafkaIO.<Void, String>write() .withBootstrapServers(kafkaBootstrapServers) .withTopic(outputTopic) .withValueSerializer(StringSerializer.class) .values() ); pipeline.run().waitUntilFinish(); }
示例四 :HBaseIO
依赖
<dependency> <groupId>org.apache.beam</groupId> <artifactId>beam-sdks-java-io-hbase</artifactId> <version>${beam.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>${hbase.version}</version> </dependency>
代码 :
/** * HBaseIo beam * 采用apache beam的方式读取hbase 数据 */ public static void getHbaseData(){ //建立管道工厂 // SparkPipelineOptions options = PipelineOptionsFactory.as(SparkPipelineOptions.class); // options.setJobName("read mongo"); // options.setSparkMaster("local[*]"); // options.setCheckpointDir("/user/chickpoint17"); PipelineOptions options = PipelineOptionsFactory.create(); options.setRunner(DirectRunner.class); config = HBaseConfiguration.create(); config.set("hbase.zookeeper.property.clientPort", hbase_clientPort); config.set("hbase.zookeeper.quorum", hbase_zookeeper_quorum); config.set("zookeeper.znode.parent", zookeeper_znode_parent); config.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem"); config.setInt("hbase.rpc.timeout", 20000); config.setInt("hbase.client.operation.timeout", 30000); config.setInt("hbase.client.scanner.timeout.period", 2000000); //建立管道 Pipeline pipeline = Pipeline.create(options); PCollection<Result> result = pipeline.apply(HBaseIO.read() .withConfiguration(config) .withTableId(hbase_table) .withKeyRange("001".getBytes(),"004".getBytes()) ); PCollection<String> process = result.apply("process", ParDo.of(new DoFn<Result, String>() { @ProcessElement public void processElement(ProcessContext c) { String row = Bytes.toString(c.element().getRow()); List<Cell> cells = c.element().listCells(); for (Cell cell:cells){ String family = Bytes.toString(cell.getFamilyArray(),cell.getFamilyLength(),cell.getFamilyOffset()); String column = Bytes.toString(cell.getQualifierArray(),cell.getQualifierOffset(),cell.getQualifierLength()); String value= Bytes.toString(cell.getValueArray(),cell.getValueOffset(),cell.getValueLength()); System.out.println(family); c.output(row+"------------------ "+family+" : "+column+" = "+value); System.out.println(row+"------------------ "+family+" : "+column+" = "+value); } } })); pipeline.run().waitUntilFinish(); }
四:说明
apache beam 目前处于孵化状态,目前对java的支持较好,python 等语言支持还待完善。故有兴趣的朋友最好选择java学习。