Apache Flink 是一个开源的分布式批数据以及流数据处理平台。目前已经升级为 Apache 顶级开源项目。不管是 Spark 仍是 Flink,他们的主要优点都是基于内存运行机器学习算法,运行速度很是快,并且 Flink 支持迭代计算。做为大数据挖掘工程师两个工具都必须掌握。
Flink 刚刚开源,国内关注人数不是不少,源代码量也不大,可是看 Spark 的源码就有点困难了,因此学习 Flink,也能学习到一个优秀的分布式框架是怎么样一步一步构建起来的。html
Flink运行支持 Linux、苹果、Windows 主流平台。不过最好仍是使用 Linux。下面给出安装前的准备:java
将预编译版本解压,进入解压缩文件,为了方便,后文统一称此目录为:FLINK_HOME。算法
单机尝试很是简单,直接执行命令:apache
sh bin/start-local.sh
bin\start-local.bat
等待其出现以下提示以后:api
D:\Java\flink\flink-0.10.1>bin\start-local.bat Starting Flink job manager. Webinterface by default on http://localhost:8081/. Don't close this batch window. Stop job manager by pressing Ctrl+C.
在浏览器中输入:http://localhost:8081/,Flink默认监听8081端口,防止其余进程占用此端口。此时出现下面的管理界面:
能够发现这个界面和 Spark 的管理界面的逻辑差很少,主要是管理正在运行的Job,已经完成的 Job,以及Task 管理和 Job 管理,Task 应该是管理 Job 的,之后再仔细分析里面的逻辑。浏览器
下面火烧眉毛先来跑一个分布式系统最经典的例子:WordCount,下面以 FLINK_HOME 的 README.txt 文件做为示例文件,测试 WordCount 程序,在 Windows 上面运行代码以及运行过程以下图:微信
D:\Java\flink\flink-0.10.1>bin\flink.bat run .\examples\WordCount.jar file:/D:/Java/flink/flink-0.10.1/README.txt file:/D:/Java/flink/flink-0.10.1/wordcount-result.txt log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.li b.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 01/15/2016 16:30:51 Job execution switched to status RUNNING. 01/15/2016 16:30:51 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to SCHEDULED 01/15/2016 16:30:51 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to DEPLOYING 01/15/2016 16:30:52 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to RUNNING 01/15/2016 16:30:52 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to SCHEDULED 01/15/2016 16:30:52 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to DEPLOYING 01/15/2016 16:30:52 CHAIN DataSource (at getTextDataSet(WordCount.java:142) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to FINISHED 01/15/2016 16:30:52 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to RUNNING 01/15/2016 16:30:53 DataSink (CsvOutputFormat (path: file:/D:/Java/flink/flink-0.10.1/wordcount-result.txt, delimiter: ))(1/1) switched to SCHEDULED 01/15/2016 16:30:53 DataSink (CsvOutputFormat (path: file:/D:/Java/flink/flink-0.10.1/wordcount-result.txt, delimiter: ))(1/1) switched to DEPLOYING 01/15/2016 16:30:53 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to FINISHED 01/15/2016 16:30:53 DataSink (CsvOutputFormat (path: file:/D:/Java/flink/flink-0.10.1/wordcount-result.txt, delimiter: ))(1/1) switched to RUNNING 01/15/2016 16:30:53 DataSink (CsvOutputFormat (path: file:/D:/Java/flink/flink-0.10.1/wordcount-result.txt, delimiter: ))(1/1) switched to FINISHED 01/15/2016 16:30:53 Job execution switched to status FINISHED.
能够看到输出日志很是详细,很方便就清楚整个运行流程,获得输出文件 wordcount-result.txt 前面10条内容以下 :app
1 1 13 1 5d002 1 740 1 about 1 account 1 administration 1 algorithms 1 and 7 another 1 any 2
欢迎关注本人微信公众号,会定时发送关于大数据、机器学习、Java、Linux 等技术的学习文章,并且是一个系列一个系列的发布,无任何广告,纯属我的兴趣。
框架