Apache Beam雄心勃勃的目标:统一大数据开发

 

Apache Beam’s Ambitious Goal: Unify Big Data Development

Apache Beam雄心勃勃的目标:统一大数据开发html

Alex Woodiegit

Apache Beam Logo

If you’re tired of using multiple technologies to accomplish various big data tasks, you may wantgithub

若是你对于使用多种技术来完成各类大数据任务感到疲倦,你也许想要web

to consider Apache Beam, a new distributed processing tool from Google that’s now incubatingexpress

使用Apache Beam,一个来自谷歌正在ASF中孵化的分布式进程工具。apache

at the ASF.编程

One of the challenges of big data development is the need to use lots of different technologies,网络

大数据开发的挑战之一是须要使用许多各类不一样的技术,session

frameworks, APIs, languages, and software development kits. Depending on what you’re tryingapp

框架,APIs,语言,软件开发包。根据你想要尝试去作的

to do–and where you’re trying to do it–you may choose MapReduce for batch processing,

哪里去作的,你能够选择MapReduce进行批量处理。

Apache Spark SQL for interactive queries, Apache Flink for real-time streaming, or a machine

交互式查询的Apache Spark SQL,实时流处理Apache Flink,或者是运行在云端上的机器学习框架

learning framework running on the cloud.

While the open source movement has provided an abundance of riches for big data developers, it

虽然开源为大数据开发人员提供了丰富的资源。

has increased pressure on the developer to pick “the right” tool for what she is trying to

与此同时,它也为开发者选择所谓“正确的”工具来完成工做增长了麻烦。

accomplish. This can be a bit overwhelming for those new to big data application development,

对于新的大数据应用程序开发来讲,这会带来势不可当的效果,

and it could even slow or even hinder adoption of open source tools. (Indeed, the complexity of 

甚至会放缓或者阻碍开源工具的使用(此外,

having to manually stitch everything together is perhaps the most common rallying cry heard by

对于全部的东西不得不手动的缝合在一块儿的复杂性,也许是大数据平台支持者最多见的反馈。)

backers of proprietary big data platforms.)

Enter Google (NASDAQ: GOOG). The Web giant is hoping to eliminate some of this second-

输入Google(纳斯达克:GOOG).网络巨头但愿经过Apache Beam来消除二次预测和痛苦的工具之间的转换

guessing and painful tool-jumping with Apache Beam, which it’s positioning as a single programming and runtime model that not only unifies development for batch, interactive, and

它将定位因而单一的编程和运行模型,不只统一批量,交互,和流处理框架,

streaming workflows, but also provides a single model for both cloud and on-premise development.

同时为云和内置部署开发提供单一模型。

The software is based on the technologies Google uses with its Cloud Dataflow service, which the

该软件是基于谷歌使用它的云数据流服务技术,

company launched in 2014 as the second coming of MapReduce for the current generation of

该公司于2014年推出了MapReduce的第二个版本,用于解决当前阶段的分布式数据处理挑战

distributed data processing challenges. (It’s worth noting that FlumeJava and MillWheel also

(值得注意的是FlumeJava和MillWheel也影响着Dataflow 模型)

influenced the Dataflow model).

Source: beam.incubator.apache.org/presentation-materials/

The open source Apache Beam project essentially is the combination of the Dataflow Software Development Kit (SDK) and the Dataflow model, along with series of “runners” that extend out to run-time frameworks, namely Apache Spark, Apache Flink, and Cloud Dataflow itself, which Google lets you try out for free and will charge you money to use in production.

Apache Beam开源项目的本质是Dataflow软件开发包和Dataflow模型的结合,以及一系列扩展到运行框架的“运行器”,即Apache Spark,Apache Flink,和云Dataflow自己,谷歌提供免费试用,投入生产收费。

Apache Beam provides a unified model for not only designing, but also executing (via runners), a variety of data-oriented workflows, including data processing, data ingestion, and data integration, according to the Apache Beam project page. “Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow,” the project says.

根据Apache Beam项目页面,Apache Beam 提供一个统一的模型,它不只在设计方面,同时在执行(经过运行),向数据工做流程,包括数据处理,数据萃取,数据集成。“数据流管道简化了大规模批量处理和流数据处理的机制,而且能够在一些运行时间上运行,好比Apache Flink,Apache Spark,和谷歌云数据流”

The project, which was originally named Apache Dataflow before taking the Apache Beam moniker, is  being championed by Jean-Baptiste Onofré, who is currently an SOA software architect at French data integration toolmaker Talend and works on many Apache Software Foundation projects. Joining Google in the project are data Artisans, which developed and maintains the Beam runner for Flink, and Cloudera, which developed and maintains the runner for Spark. Developers from Cask and PayPal are also involved.

项目在使用Apache Beam moniker以前是叫 Apache Dataflow,

Onofré describes the impetus behind the technology in a recent post to his blog:

“Imagine, you have a Hadoop cluster where you used MapReduce jobs,” he writes. “Now, you want to ‘migrate’ these jobs to Spark: you have to refactore [sic] all your jobs which requires lot of works and cost a lot. And after that, see the effort and cost if you want to change for a new platform like Flink: you have to refactore [sic] your jobs again.

“Dataflow aims to provide an abstraction layer between your code and the execution runtime,” he continues. “The SDK allows you to use an unified programming model: you implement your data processing logic using the Dataflow SDK, the same code will run on different backends. You don’t need to refactore [sic] and change the code anymore!

There are four main constructs in the Apache Beam SDK, according to the Apache Beam proposal posted to the ASF’s website. These constructs include:

  • Pipelines–the data processing job made of a series of computations including input, processing, and output;
  • PCollections–Bounded (or unbounded) datasets which represent the input, intermediate and output data in pipelines;
  • PTransforms–A data processing step in a pipeline in which one or more PCollections are an input and output;
  • and I/O Sources and Sinks–APIs for reading and writing data which are the roots and endpoints of the pipeline.

“Beam can be used for a variety of streaming or batch data processing goals including ETL, stream analysis, and aggregate computation,” the . The underlying programming model for Beam provides MapReduce-like parallelism, combined with support for powerful data windowing, and fine-grained correctness control.

The Evolution of Apache Beam

Source: beam.incubator.apache.org/presentation-materials/

Many of the concepts behind Beam are similar to those found in Spark. However, there are important differences, as Google engineers discussed in a recent article.

“Spark has had a huge and positive impact on the industry thanks to doing a number of things much better than other systems had done before,” the engineers write. “But Dataflow holds distinct advantages in programming model flexibility, power, and expressiveness, particularly in the out-of-order processing and real-time session management arenas…. The fact is: no other massive-scale data parallel programming model provides the depth-of-capability and ease-of-use that Dataflow/Beam does.”

Portability of code is a key feature of Beam. “Beam was designed from the start to provide a portable programming layer,” Onofré and others write in the Beam proposal. “When you define a data processing pipeline with the Beam model, you are creating a job which is capable of being processed by any number of Beam processing engines.”

Beam’s Java-based SDK is currently available at GitHub (as well as on Stack Overflow), and a second SDK for Python is currently in the works. The developers have an ambitious set of goals, including creating additional Beam runners (Apache Storm and MapReduce are possible contenders), as well as support for other programming.

Beam developers note that the project is also closely related to Apache Crunch, a Java-based framework for Hadoop and Spark that simplifies the programming of data pipelines for common tasks such as joining and aggregations, which are tedious to implement in MapReduce.

Google announced in January that it wanted to donate Dataflow to the ASF, and the ASF accepted the proposal in early February, when it was renamed Apache Beam. The project, which is in the process of moving from GitHub to Apache, is currently incubating.

“In the long term, we believe Beam can be a powerful abstraction layer for data processing,”  the Beam proposal says. “By providing an abstraction layer for data pipelines and processing, data workflows can be increasingly portable, resilient to breaking changes in tooling, and compatible across many execution engines, runtimes, and open source projects.”

Related Items:

Apache Flink Creators Get $6M to Simplify Stream Processing

Google Releases Cloud Processor For Hadoop, Spark

Google Reimagines MapReduce, Launches Dataflow

相关文章
相关标签/搜索