在 Databricks 可得到 Spark 1.5 预览版

时间 2019-11-10

标签 databricks 可得到 spark 1.5 预览栏目 Spark 繁體版

原文原文链接

咱们兴奋地宣布,从今天开始,Apache Spark1.5.0的预览数据砖是可用的。咱们的用户如今能够选择提供集群与Spark 1.5或先前的火花版本准备好几个点击。html

正式,Spark 1.5预计将在数周内公布,和社区所作的QA测试的版本。鉴于火花的快节奏发展,咱们以为这是很重要的,使咱们的用户尽快开发和利用新特性。与传统的本地软件部署,它能够须要几个月,甚至几年,从供应商收到软件更新。数据砖的云模型,咱们能够在几小时内更新,让用户试他们的火花版本的选择。

git

What’s New?

The last few releases of Spark focus on making data science more accessible, through high-level programming APIs such as DataFrames, machine learning pipelines, and R language support. A large part of Spark 1.5, on the other hand, focuses on under-the-hood changes to improve Spark’s performance, usability, and operational stability.github

Spark 1.5 delivers the first phase of Project Tungsten, a new execution backend for DataFrames/SQL. Through code generation and cache-aware algorithms, Project Tungsten improves the runtime performance with out-of-the-box configurations. Through explicit memory management and external operations, the new backend also mitigates the inefficiency in JVM garbage collection and improves robustness in large-scale workloads.web

Over the next few weeks, we will be writing about Project Tungsten. To give you a sneak peek, the above chart compares the out-of-the-box (i.e. no configuration changes) performance of an aggregation query (16 million records and 1 million composite keys) using Spark 1.4 and Spark 1.5 on my laptop.算法

Streaming workloads typically run 24/7 and have stringent stability requirements. In this release, Typesafe has introduced Backpressure in Spark Streaming. With this feature, Spark Streaming can dynamically control the data ingest rates to adapt to unpredictable variations in processing load. This allows streaming applications to be more robust against bursty workloads and downstream delays.sql

Of course, Spark 1.5 is the work of more than 220 open source contributors from over 80 organizations, and includes a lot more than the above two. Some examples include:apache

New machine learning algorithms: multilayer perceptron classifier, PrefixSpan for sequential pattern mining, association rule generation, etc.后端
Improved R language support and GLMs with R formula.api
Better instrumentation and reporting of memory usage in web UI.缓存

Stay tuned for future blog posts covering the release as well as deep dives into specific improvements.

How do I use it?

Launching a Spark 1.5 cluster is as easy as selecting Spark 1.5 experimental version in the cluster creation interface in Databricks.

Once you hit confirm, you will get a Spark cluster ready to go with Spark 1.5.0 and start testing the new release. Multiple Spark version support in Databricks also enables users to run Spark 1.5 canary clusters side-by-side with existing production Spark clusters.

You can find the work-in-progress documentation for Spark 1.5.0 here. Please be aware that just like any other preview software, Spark 1.5.0 support is experimental. There will be bugs and quirks that we find and fix in the next couple of weeks. The good news is that you don’t have to worry about following the development or upgrading yourself. As we discover and fix bugs in the open source project, the Spark 1.5 option in Databricks will also be updated automatically. If you encounter a bug, please report it by filing a JIRA ticket.

To try Databricks, sign up for a free 30-day trial.

在上一次北京sparkmeetup技术分享会上，一个spark commiter就说他们忙着Spark 1.5（核心工做就说Tungsten），一个新的DataFrames / SQL执行后端。项目支持缓存经过代码生成算法,提升运行时性能与Tungsten的开箱即用配置。经过显式的内存管理和外部操做,新的后端也减轻了低效JVM的垃圾收集,提升了鲁棒性在大规模的工做负载

目前来看，spark1.5第一阶段目前是完成，估计后期应该有不少优化和代码修复，但可尝尝甜头，若是想了解1.5版本代码，看github spark1.5 branch，我的感受主要仍是spark sql的提高吧，由于大多数公司都是 spark on yarn的方式，大多数任务提高但愿在spark sql上面