分布式计算学习笔记

分布式计算学习笔记html

 

1、概述

1.1 Kafka

Apache Kafka是分布式发布-订阅消息系统。它最初由LinkedIn公司开发,以后成为Apache项目的一部分。Kafka是一种快速、可扩展的、设计内在就是分布式的,分区的和可复制的提交日志服务。node

关于Kafka:react

一、一个分布式的消息发布、订阅系统;web

二、设计用来处理实时的流数据;算法

三、最初由LinkedIn开发,现为Apache的一部分;数据库

四、没有遵照JMS标准,也没有使用JMS的API;apache

五、把个分区的更新保存到主题中 (Kafka maintains feeds of messages in topics)。编程

Kafka是一个消息中间件,它的特色是:json

A、关注大吞吐量,而不是别的特性;api

B、针对实时性场景;

C、关于消息被处理的状态是在Consumer端维护,而不是由Kafka Server端维护;

D、分布式,Producer、Broker和Consumer都分布于多台机器上。

 

 

1.1.1 Total

Apache Kafka™ is a distributed streaming platform. What exactly does that mean?

We think of a streaming platform as having three key capabilities:

²  It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.

²  It lets you store streams of records in a fault-tolerant way.

²  It lets you process streams of records as they occur.

What is Kafka good for?

ü  It gets used for two broad classes of application:

ü  Building real-time streaming data pipelines that reliably get data between systems or applications

ü  Building real-time streaming applications that transform or react to the streams of data

ü  To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.

First a few concepts:

  • Kafka is run as a cluster on one or more servers.
  • The Kafka cluster stores streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.

Kafka has four core APIs:

  1. The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  2. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  3. The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  4. The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version.

1.1.2 Topics and Logs

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

 

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

 

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example, a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

1.1.3 Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

1.1.4 Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

1.1.5 Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

 

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

1.2 Storm

Storm是一个实时处理系统,其典型应用场景:消费者拉取计算。Storm提供了storm-kafka,于是能够直接kafka的低级API读取数据。

Storm提供如下功能:

一、至少一次的消息处理;

二、容错;

三、水平可扩展;

四、没有中间队列;

五、更少的操做开销;

六、作最恰当的工做(just works);

七、宽长应用场景的覆盖,如:流处理、连续计算、分布式的远程进程调用等。

Storm架构图

 

Storm工做任务的Topology:

 

1.2.1 任务调度及负载均衡

 

  1. nimbus将能够工做的worker称为worker-slot.
  2. nimbus是整个集群的控管核心,整体负责了topology的提交、运行状态监控、负载均衡及任务从新分配,等等工做。nimbus分配的任务包含了topology代码所在的路径(在nimbus本地)、tasks、executors及workers信息。worker由node + port惟一肯定。
  3. supervisor负责实际的同步worker的操做。一个supervisor称为一个node。所谓同步worker,是指响应nimbus的任务调度和分配,进行worker的创建、调度与销毁。其经过将topology的代码从nimbus下载到本地以进行任务调度。
  4. 任务分配信息中包含task到worker的映射信息task -> node + host,因此worker节点可据此信息判断跟哪些远程机器通信。

 

1.2.2 Worker

工做者,执行一个拓扑的子集,可能为一个或多个组件执行一个或者多个执行任务(Exectutor)。

1.2.3 Executor

工做者产生的线程。

1.2.4 Task

实际的数据处理部分。

1.3 Flume

Flume是Cloudera提供的一个分布式、可靠、和高可用的海量日志采集、聚合和传输的日志收集系统,支持在日志系统中定制各种数据发送方,用于收集数据。同时,Flume提供对数据进行简单处理,并写到各类数据接受方的能力。

1.4 Hadoop

Hadoop是一个由Apache基金会所开发的分布式系统基础架构。

Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有高容错性的特色,而且设计用来部署在低廉的(low-cost)硬件上;并且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,能够以流的形式访问(streaming access)文件系统中的数据。

Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算

1.5 Spark

Apache Spark is a fast and general engine for large-scale data processing

 

Spark Streaming属于Spark的核心api,它支持高吞吐量、支持容错的实时流数据处理。

它能够接受来自Kafka, Flume, Twitter, ZeroMQ和TCP Socket的数据源,使用简单的api函数好比 mapreducejoinwindow等操做,还能够直接使用内置的机器学习算法、图算法包来处理数据。

 

它的工做流程像下面的图所示同样,接受到实时数据后,给数据分批次,而后传给Spark Engine处理最后生成该批次的结果。它支持的数据流叫Dstream,直接支持Kafka、Flume的数据源。Dstream是一种连续的RDDs。

 

1.5.1 Shark

Shark ( Hive on Spark): Shark基本上就是在Spark的框架基础上提供和Hive同样的H iveQL命令接口,为了最大程度的保持和Hive的兼容性,Shark使用了Hive的API来实现query Parsing和 Logic Plan generation,最后的PhysicalPlan execution阶段用Spark代替Hadoop MapReduce。经过配置Shark参数,Shark能够自动在内存中缓存特定的RDD,实现数据重用,进而加快特定数据集的检索。同时,Shark经过UDF用户自定义函数实现特定的数据分析学习算法,使得SQL数据查询和运算分析能结合在一块儿,最大化RDD的重复使用。

1.5.2 Spark streaming

Spark streaming: 构建在Spark上处理Stream数据的框架,基本的原理是将Stream数据分红小的时间片段(几秒),以相似batch批量处理的方式来处理这小部分数据。Spark Streaming构建在Spark上,一方面是由于Spark的低延迟执行引擎(100ms+)能够用于实时计算,另外一方面相比基于Record的其它处理框架(如Storm),RDD数据集更容易作高效的容错处理。此外小批量处理的方式使得它能够同时兼容批量和实时数据处理的逻辑和算法。方便了一些须要历史数据和实时数据联合分析的特定应用场合。

1.5.3 Bagel

Bagel: Pregel on Spark,能够用Spark进行图计算,这是个很是有用的小项目。Bagel自带了一个例子,实现了Google的PageRank算法。

1.6 Solr

Solr是一个独立的企业级搜索应用服务器,它对外提供相似于Web-service的API接口。用户能够经过http请求,向搜索引擎服务器提交必定格式的XML文件,生成索引;也能够经过Http Get操做提出查找请求,并获得XML格式的返回结果。

Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,而且提供了一个完善的功能管理界面,是一款很是优秀的全文搜索引擎

1.7 MongoDB

MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。

MongoDB是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的。他支持的数据结构很是松散,是相似jsonbson格式,所以能够存储比较复杂的数据类型。Mongo最大的特色是他支持的查询语言很是强大,其语法有点相似于面向对象的查询语言,几乎能够实现相似关系数据库单表查询的绝大部分功能,并且还支持对数据创建索引

1.8 Mesos

MesosApache下的开源分布式资源管理框架,它被称为是分布式系统的内核。Mesos最初是由加州大学伯克利分校的AMPLab开发的,后在Twitter获得普遍使用。

Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.

1.9 HBase

HBase是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System)所提供的分布式数据存储同样,HBase在Hadoop之上提供了相似于Bigtable的能力。HBase是Apache的Hadoop项目的子项目。HBase不一样于通常的关系数据库,它是一个适合于非结构化数据存储的数据库。另外一个不一样的是HBase基于列的而不是基于行的模式。

 

1.10 Cassandra

Cassandra 的数据模型是基于列族(Column Family)的四维或五维模型。它借鉴了 Amazon的Dynamo和 Google's BigTable的数据结构和功能特色,采用Memtable和SSTable 的方式进行存储。在 Cassandra 写入数据以前,须要先记录日志 ( CommitLog ),而后数据开始写入到Column Family对应的 Memtable中Memtable 是一种按照 key 排序数据的内存结构,在知足必定条件时,再把Memtable的数据批量的刷新到磁盘上,存储为SSTable 。

1.11 ZooKeeper

ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,是Google的Chubby一个开源的实现,是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件,提供的功能包括:配置维护、域名服务、分布式同步、组服务等。

ZooKeeper的目标就是封装好复杂易出错的关键服务,将简单易用的接口和性能高效、功能稳定的系统提供给用户。

ZooKeeper是以Fast Paxos算法为基础的,Paxos 算法存在活锁的问题,即当有多个proposer交错提交时,有可能互相排斥致使没有一个proposer能提交成功,而Fast Paxos做了一些优化,经过选举产生一个leader (领导者),只有leader才能提交proposer。

ZooKeeper的基本运转流程:

一、选举Leader;

二、同步数据;

三、Leader要具备最高的执行ID,相似root权限;

四、集群中大多数的机器获得响应并follow选出的Leader。

 

1.12 YARN

YARN是一个彻底重写的Hadoop集群架构。YARN是新一代Hadoop资源管理器,经过YARN,用户能够运行和管理同一个物理集群机上的多种做业,例如MapReduce批处理和图形处理做业。这样不只能够巩固一个组织管理的系统数目,并且能够对相同的数据进行不一样类型的数据分析。

与初版Hadoop中经典的MapReduce引擎相比,YARN 在可伸缩性、效率和灵活性上提供了明显的优点。小型和大型Hadoop集群都从YARN中受益不浅。对于最终用户(开发人员,而不是管理员),这些更改几乎是不可见的,由于可使用相同的MapReduce API和CLI 运行未经修改的MapReduce做业。

 

2、架构

2.1 Spark

Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行计算框架,Spark基于map reduce算法实现的分布式计算,拥有Hadoop MapReduce所具备的优势;但不一样于MapReduce的是Job中间输出和结果能够保存在内存中,从而再也不须要读写HDFS,所以Spark能更好地适用于数据挖掘与机器学习等须要迭代的map reduce的算法。

 

2.2 Spark Streaming

Spark Streaming是将流式计算分解成一系列短小的批处理做业。这里的批处理引擎是Spark,也就是把Spark Streaming的输入数据按照batch size(如1秒)分红一段一段的数据(Discretized Stream),每一段数据都转换成Spark中的RDD(Resilient Distributed Dataset),而后将Spark Streaming中对DStream的Transformation操做变为针对Spark中对RDD的Transformation操做,将RDD通过操做变成中间结果保存在内存中。整个流式计算根据业务的需求能够对中间的结果进行叠加,或者存储到外部设备。

 

2.3 Flume+Kafak+Storm

消息经过各类方式进入到Kafka消息中间件,好比能够经过使用Flume来收集日志数据,而后在Kafka中路由暂存,而后再由实时程序Storm作实时分析,这时咱们就须要将Storm的Spout中读取Kafka中的消息,而后交由具体的Spot组件去分析处理。

Flume+Kafak+Storm,Flume做为消息的Producer,生产的消息数据(日志数据、业务请求数据等),发布到Kafka中,而后经过订阅的方式,使用Storm的Topology做为消息的Consumer,在Storm集群中分别进行处理。

处理方式有如下两种:

一、直接使用Storm的Topology对数据进行实时分析处理;

二、整合Storm+HDFS,将消息处理后写入HDFS进行离线分析处理。

Kafka集群须要保证各个Broker的ID在整个集群中必须惟一。

Storm集群也依赖Zookeeper集群,要保证Zookeeper集群正常运行。

 

 

 

 

2.4 Kafka+Storm应用场景

一、须要扩展和计划扩展

Kafka+Storm能够方便的扩展拓扑,扩展仅限于硬件。

二、快速应用

开源软件快速 进化和社区支持。微服务,只须要作必须作的事情。

三、风险

拓扑并不成熟,但也别无选择。

 

 

3、比较

3.1 Storm和SparkStreaming

Storm风暴和Spark Streaming火花流都是分布式流处理的开源框架。

3.1.1 处理模型、延迟

Storm处理的是每次传入的一个事件,而Spark Streaming是处理某个时间段窗口内的事件流。所以,Storm处理一个事件能够达到秒内的延迟,而Spark Streaming则有几秒钟的延迟。

3.1.2 容错、数据保证

在容错数据保证方面的权衡是Spark Streaming提供了更好的支持容错状态计算,在Storm中每一个单独的记录当它经过系统时必须被跟踪,因此Storm可以至少保证每一个记录将被处理一次,可是在从错误中恢复过来的时候容许出现重复记录。这意味着可变状态可能不正确地被更新两次。

简而言之若是你须要处理秒内的延迟,Storm是一个不错的选择,并且没有数据丢失。若是你须要有状态的计算,并且要彻底保证每一个事件只被处理一次,Spark Streaming则更好。Spark Streaming编程逻辑也可能更容易,由于它相似于批处理程序Hadoop。

3.1.3 实现、编程API

Storm初次是由Clojure实现,而Spark Streaming是使用Scala。若是你想看看代码还让本身的定制时须要注意的地方,这样以便发现每一个系统是如何工做的。Storm是由BackType和Twitter开发;Spark Streaming是在加州大学伯克利分校开发。

Spark基于这样的理念:当数据庞大时,把计算过程传递给数据要比把数据传递给计算过程要更富效率,每一个节点存储(或缓存)它的数据集,而后任务被提交给节点。

Storm的架构和Spark截然相反。Storm是一个分布式流计算引擎。每一个节点实现一个基本的计算过程,而数据项在互相链接的网络节点中流进流出。和Spark相反,这个是把数据传递给过程。两个框架都用于处理大量数据的并行计算。Storm在动态处理生成的“小数据块”上要更好。不肯定哪一种方式在数据吞吐量上要具优点,不过Storm计算时间延迟要小。

3.2 Storm和Hadoop

Hadoop是磁盘级计算,进行计算时,数据在磁盘上,须要读写磁盘。Storm是内存级计算,数据直接经过网络导入内存。

Hadoop

Storm

批处理

实时处理

Jobs运行到结束

Topologies一直运行

结点有状态

结点无状态

可扩展

可扩展

保证数据不丢

保证数据不丢

开源

开源

大的批处理(Big Batch processing)

快速、反映式、实时的处理(Fast, reactive, real time processing)

3.3 Spark和Hadoop

Spark的中间数据放到内存中,对于迭代运算效率更高。

Spark更适合于迭代运算比较多的ML和DM运算。由于在Spark里面,有RDD的抽象概念。

3.3.1 Spark比Hadoop更通用。

Spark提供的数据集操做类型有不少种,不像Hadoop只提供了Map和Reduce两种操做。好比map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy等多种操做类型,Spark把这些操做称为Transformations。同时还提供Count, collect, reduce, lookup, save等多种actions操做。

这些多种多样的数据集操做类型,给给开发上层应用的用户提供了方便。各个处理节点之间的通讯模型再也不像Hadoop那样就是惟一的Data Shuffle一种模式。用户能够命名,物化,控制中间结果的存储、分区等。能够说编程模型比Hadoop更灵活。

不过因为RDD的特性,Spark不适用那种异步细粒度更新状态的应用,例如web服务的存储或者是增量的web爬虫和索引。就是对于那种增量修改的应用模型不适合。

3.3.2 容错性

在分布式数据集计算时经过checkpoint来实现容错,而checkpoint有两种方式,一个是checkpoint data,一个是logging the updates。用户能够控制采用哪一种方式来实现容错。

3.3.3 可用性

Spark经过提供丰富的Scala, Java,Python API及交互式Shell来提升可用性。

3.4 Spark vs. Hadoop vs. Storm

3.4.1 相同点

一、都是开源消息处理框架;

二、都能用于实时的商业智能和大数据分析;

三、因其实现方法简单,因此经常使用于大数据处理;

四、都是基于JVM的实现,使用的语言有Java、Scala和Clojure;

3.4.2 不一样点

数据处理模型

Hadoop MapReduce最适合用于批处理。对于小数据要求实时处理的应用,须要用其余的开源平台,如:Impala或者Storm。Apache Spark设计主要用于普通的数据处理,它能够已有的机器学习库和流程图。由于Spark有很高的性能,因此他便可以用于批处理也能够实时处理。Spark能够在单个平台处理事务,而不须要跨平台。

微批处理是一种特殊的尺寸更小的批处理,微批处理提供带状态的计算,所以开窗口(Windowing)变得更容易。

Apache Storm是基于流的处理架构,使用Trident能够用于微批处理。

Spark是批处理架构,经过Spark Streaming能够用于微批处理。

Spark和Storm的主要区别在于:Spark是数据并行的计算,Storm是任务并行的计算。

Feature

Apache Storm/Trident

Spark Streaming

Programming languages

Java Clojure Scala

Java Scala

Reliability

Supports “exactly once” processing mode. Can be used in other modes like  “at least once” processing & “at most once” processing mode as well.

Supports only “exactly once” processing mode.

Stream Source

Spout

HDFS

Stream Primitives

Tuple, Partitions

DStream

Persistence

MapState

Per RDD

State management

Supported

Supported

Resource Management

Yarn, Mesos

Yarn, Mesos

Provisioning

Apache Ambari

Basic monitoring using ganglia

Messaging

ZeroMA, Netty

Netty, Akka

在Spark流处理中,若是一个工做者(Worker)结点失败,系统能够根据数据输入的拷贝从新计算。可是若是下网络接收者(network receiver)出现问题,数据将没法复制到其余结点。总的来讲,只在HDFS备份的数据是安全的。

在Storm/Trident中,若是一个工做运行出错,雨云(nimbus)会把失败工做的状态分配给系统中的其余工做者。全部发给失败结点的数据三元组都会超时,于是能够自动的发送给其余结点。在Storm中,发送可达的保证是基于数据源安全上。

Situation

Choice of framework

Strictly low latency

Storm can provide better latency with fewer restrictions than spark streaming.

Low development cost

With Spark, the same code base can be used for batch processing and stream processing, But with Storm, it is not possible.

Message Delivery Gurantee

Both Apache Storm(Trident) and Spark streaming offer “exactly once” processing mode

Fault tolerance

Both frameworks are relatively fault tolerant to the same extent.

In Apache Storm/Trident, if a process fails, the supervisor process will restart it automatically as state management is handled through ZooKeeper.

Spark handles restarting workers via resource manager which can be YARN, Mesos, or its standalone manager.

4、总结

Spark是大数据分析工具中的瑞士军刀。

Storm更擅长于可靠的处理无限制流数据的实时处理,Hadoop这适合批处理。

Hadoop、Spark、Storm都是开源的数据处理平台,虽然他们的功能相互重叠,但他们有不一样的侧重点。

Hadoop是开源的分布式数据框架。Hadoop用于大数据集的存储和在不一样集群上的数据分析和处理。Hadoop的MapReduce分布计算用于比处理,这也是Hadoop作为数据仓库而不是数据分析工具的缘由。

Spark没有本身的分布式存储系统,因此须要借Hadoop的HDFS来保存数据。

Storm是一个任务并行、开源分布式计算系统。Storm在拓扑中有本身独立的工做流,若有向非循环图。拔掉一直运行,除非被打断或者系统中止运行。Storm并不在Hadoop集群上工做,而是基于ZooKeeper来管理处理流程。Storm能够读、写文件到HDFS。

Hadoop是大数据处理中很受欢迎,但Spark和Storm更受欢迎。

 

 

5、附注

下为英语原文,已经翻译。

Understanding the differences

1) Data processing models

Hadoop MapReduce is best suited for batch processing. For bit data applications that require real time options, organizations must use other open source platform like Impala or Storm. Apache Spark is designed to do more than plain data processing as it can make use of existing machine learning libraries and process graphs. Thanks to the high performance of Apache spark, it can be used for both batch processing and real time processing. Spark provides an opportunity to use a single platform for everything rather than splitting the tasks on different open source platforms-avoiding the overhead of learning and maintaining different platforms.

Micro-batching s a special kind of batch processing wherein the batch size is orders smaller. Windowing becomes easy with micro-batching as it offer stateful computation of data.

Storm is a stream processing framework that also does micro-batching (Trident).

Spark is a batch processing framework that also does micro-batching (Spark Streaming).

Apache Storm is a stream processing framework, which can do micro-batching using Trident(an abstraction on Storm to perform statefule stream processing in batches).

Spark is a frame to perform batch processing. It can also do micro-batching using Spark Streaming(an abstraction on Spark to perform stateful stream processing).

One key difference between these two frameworks is that spark performs Data-parallel computions while Storm performs Taks-Paralle computations.

Feature

Apache Storm/Trident

Spark Streaming

Programming languages

Java Clojure Scala

Java Scala

Reliability

Supports “exactly once” processing mode. Can be used in other modes like  “at least once” processing & “at most once” processing mode as well.

Supports only “exactly once” processing mode.

Stream Source

Spout

HDFS

Stream Primitives

Tuple, Partitions

DStream

Persistence

MapState

Per RDD

State management

Supported

Supported

Resource Management

Yarn, Mesos

Yarn, Mesos

Provisioning

Apache Ambari

Basic monitoring using ganglia

Messaging

ZeroMA, Netty

Netty, Akka

 

 

 

 

In Spark streaming, if a worker node fails, then the system can re-compute from the lest over copy of input data. But , if the node where the network receiver runs is failing, the the data which is not yet replicated to other node might be lost. In short, only HDFS backed data source is safe.

In Apache Storm/Trident, if a worker fails, the nimbus assigns the worker’s tasks to other nodes in the system. All tuples sent to the failed node will be timed out and hence replayed automatically. In Storm as well, delivery gurantee depends on a safe data source.

Situation

Choice of framework

Strictly low latency

Storm can provide better latency with fewer restrictions than spark streaming.

Low development cost

With Spark, the same code base can be used for batch processing and stream processing, But with Storm, it is not possible.

Message Delivery Gurantee

Both Apache Storm(Trident) and Spark streaming offer “exactly once” processing mode

Fault tolerance

Both frameworks are relatively fault tolerant to the same extent.

In Apache Storm/Trident, if a process fails, the supervisor process will restart it automatically as state management is handled through ZooKeeper.

Spark handles restarting workers via resource manager which can be YARN, Mesos, or its standalone manager.

 

Spark is what you might call a Swiss Army knife of Big Data analytics tools.

Storm makes it easy to reliable process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

Hadoop, Spark, Storm are some of the popular open source platforms for real time data processing. Each of these tools has some intersecting functionalities. However, they have different role to play.

Hadoop is an open source distributed processing framework.  Hadoop is used for storing large data sets and running distributed analytics processes on various clusters. Hadoop MapReduce is limited to batch processing of one job at a time. This is the reason why these days Hadoop is being used extensively as a data warehousing tool and not as data analysis tool.

Spark doesn’t have its own distributed storage system. This is the reason why most of the big data projects install Apache Spark on Hadoop so that the advanced big data applications can be run on Spark by using the data stored in Hadoop Distributed File System.

Storm is a task parallel, open source distributed computing system. Storm has its independent workflows in topologies i.e. Directed Acyclic Graphs. The topologies in Storm execute until there is some kind of a disturbance or if the system shuts down completely. Storm does not run on Hadoop clusters but uses ZooKeeper and its own minion worker to manage its processes. Storm can read and write files to HDFS.

Apache Hadoop is hot in the big data market but its cousins Spark and Storm are hotter.