Spark 框架html
Spark与Storm的对比
对于Storm来讲:
一、建议在那种须要纯实时,不能忍受1秒以上延迟的场景下使用,好比实时金融系统,要求纯实时进行金融交易和分析
二、此外,若是对于实时计算的功能中,要求可靠的事务机制和可靠性机制,即数据的处理彻底精准,一条也不能多,一条也不能少,也能够考虑使用Storm
三、若是还须要针对高峰低峰时间段,动态调整实时计算程序的并行度,以最大限度利用集群资源(一般是在小型公司,集群资源紧张的状况),也能够考虑用Storm
四、若是一个大数据应用系统,它就是纯粹的实时计算,不须要在中间执行SQL交互式查询、复杂的transformation算子等,那么用Storm是比较好的选择java
对于Spark Streaming来讲:
一、若是对上述适用于Storm的三点,一条都不知足的实时场景,即,不要求纯实时,不要求强大可靠的事务机制,不要求动态调整并行度,那么能够考虑使用Spark Streaming
二、考虑使用Spark Streaming最主要的一个因素,应该是针对整个项目进行宏观的考虑,即,若是一个项目除了实时计算以外,还包括了离线批处理、交互式查询等业务功能,并且实时计算中,可能还会牵扯到高延迟批处理、交互式查询等功能,那么就应该首选Spark生态,用Spark Core开发离线批处理,用Spark SQL开发交互式查询,用Spark Streaming开发实时计算,三者能够无缝整合,给系统提供很是高的可扩展性node
Spark Streaming与Storm的优劣分析
事实上,Spark Streaming绝对谈不上比Storm优秀。这两个框架在实时计算领域中,都很优秀,只是擅长的细分场景并不相同。
Spark Streaming仅仅在吞吐量上比Storm要优秀,而吞吐量这一点,也是从来挺Spark Streaming,贬Storm的人着重强调的。可是问题是,是否是在全部的实时计算场景下,都那么注重吞吐量?不尽然。所以,经过吞吐量说Spark Streaming强于Storm,不靠谱。web
Storm在实时延迟度上,比Spark Streaming就好多了,前者是纯实时,后者是准实时。并且,Storm的事务机制、健壮性 / 容错性、动态调整并行度等特性,都要比Spark Streaming更加优秀。
Spark Streaming,有一点是Storm绝对比不上的,就是:它位于Spark生态技术栈中,所以Spark Streaming能够和Spark Core、Spark SQL无缝整合,也就意味着,咱们能够对实时处理出来的中间数据,当即在程序中无缝进行延迟批处理、交互式查询等操做。这个特色大大加强了Spark Streaming的优点和功能shell
下载 spark、scala的包
以下操做:express
[hadoop@oversea-stable ~]$ wget http://mirrors.hust.edu.cn/apache/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz --2018-06-27 10:07:25-- http://mirrors.hust.edu.cn/apache/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz Resolving mirrors.hust.edu.cn (mirrors.hust.edu.cn)... 202.114.18.160 Connecting to mirrors.hust.edu.cn (mirrors.hust.edu.cn)|202.114.18.160|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 226128401 (216M) [application/octet-stream] Saving to: ‘spark-2.3.0-bin-hadoop2.7.tgz’ 100%[================================================================================================================>] 226,128,401 45.4KB/s in 68m 12s 2018-06-27 11:15:38 (54.0 KB/s) - ‘spark-2.3.0-bin-hadoop2.7.tgz’ saved [226128401/226128401] [hadoop@oversea-stable ~]$ [hadoop@oversea-stable ~]$ wget https://scala-lang.org/files/archive/nightly/2.12.x/scala-2.12.5-bin-3995c7e.tgz --2018-06-27 11:50:02-- https://scala-lang.org/files/archive/nightly/2.12.x/scala-2.12.5-bin-3995c7e.tgz Resolving scala-lang.org (scala-lang.org)... 128.178.154.159 Connecting to scala-lang.org (scala-lang.org)|128.178.154.159|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 20244926 (19M) [application/x-gzip] Saving to: ‘scala-2.12.5-bin-3995c7e.tgz’ 100%[================================================================================================================>] 20,244,926 516KB/s in 4m 39s 2018-06-27 11:54:43 (70.8 KB/s) - ‘scala-2.12.5-bin-3995c7e.tgz’ saved [20244926/20244926]
配置环境变量
以下操做:apache
[hadoop@oversea-stable ~]$ tail -4 .bash_profile export SCALA_HOME=/opt/scala export SPARK_HOME=/opt/spark PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH export PATH [hadoop@oversea-stable ~]$
配置并同步scala
操做以下:vim
[hadoop@oversea-stable ~]$ tar xfz scala-2.12.5-bin-3995c7e.tgz -C /opt/ [hadoop@oversea-stable opt]$ ln -s scala-2.12.5-bin-3995c7e scala [hadoop@oversea-stable opt]$ for((i=67;i>=64;i--));do rsync -avzoptlg scala-2.12.5-bin-3995c7e 192.168.20.$i:/opt/ ; done
配置并同步spark
操做以下:浏览器
[hadoop@oversea-stable ~]$ tar xfz spark-2.3.0-bin-hadoop2.7.tgz -C /opt/ [hadoop@oversea-stable ~]$ cd /opt/ [hadoop@oversea-stable opt]$ ln -s spark-2.3.0-bin-hadoop2.7 spark [hadoop@oversea-stable opt]$ cd spark/conf [hadoop@oversea-stable conf]$ pwd /opt/spark/conf [hadoop@oversea-stable conf]$ cp spark-env.sh{.template,} [hadoop@oversea-stable conf]$ vim spark-env.sh [hadoop@oversea-stable conf]$ tail -8 spark-env.sh export SCALA_HOME=/opt/spark export JAVA_HOME=/usr/java/latest export SPARK_MASTER_IP=192.168.20.68 export SPARK_WORKER_MEMORY=1024m export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath) export SPARK_LOCAL_IP=192.168.20.68 #修改成每一个node自己的IP export SPARK_MASTER_HOST=192.168.20.68 [hadoop@oversea-stable conf]$ [hadoop@oversea-stable conf]$ cp slaves{.template,} [hadoop@oversea-stable conf]$ vim slaves [hadoop@oversea-stable conf]$ tail -3 slaves open-stable permission-stable sp-stable [hadoop@oversea-stable conf]$ [hadoop@oversea-stable conf]$ cd /opt [hadoop@oversea-stable opt]$ for((i=67;i>=64;i--));do rsync -avzoptlg spark-2.3.0-bin-hadoop2.7 192.168.20.$i:/opt/ ; done
启动spark
操做以下所示:bash
[hadoop@oversea-stable opt]$ cd spark [hadoop@oversea-stable spark]$ sbin/start-slaves.sh open-stable: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-open-stable.out permission-stable: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-permission-stable.out sp-stable: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-sp1-stable.out [hadoop@oversea-stable spark]$ vim conf/slaves [hadoop@oversea-stable spark]$ sbin/start-slaves.sh open-stable: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-open-stable.out permission-stable: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-permission-stable.out [hadoop@oversea-stable spark]$
验证
(1) 检查log,确认无Error
[hadoop@oversea-stable spark]$ cd logs [hadoop@oversea-stable logs]$ ls spark-hadoop-org.apache.spark.deploy.master.Master-1-oversea-stable.out [hadoop@oversea-stable logs]$
(2) 查看各server 进程状态
[hadoop@oversea-stable logs]$ jps 12480 DFSZKFailoverController 27522 HMaster 6738 Master 7301 Jps 12123 NameNode 12588 ResourceManager [hadoop@oversea-stable logs]$ [hadoop@open-stable logs]$ jps 15248 JournalNode 15366 NodeManager 16248 Jps 16169 Worker 15131 DataNode 18125 QuorumPeerMain 22781 HRegionServer [hadoop@open-stable logs]$ [hadoop@permission-stable logs]$ jps 12800 QuorumPeerMain 24391 NodeManager 4647 Jps 24152 DataNode 4568 Worker 2236 HRegionServer 24269 JournalNode [hadoop@permission-stable logs]$ [hadoop@sp1-stable logs]$ jps 7617 QuorumPeerMain 9233 Jps 21683 NodeManager 21540 JournalNode 28966 HRegionServer 21451 DataNode 8813 Worker [hadoop@sp1-stable logs]$
(3) 运行spark-shell
[hadoop@oversea-stable logs]$ spark-shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/spark-2.3.0-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-2.9.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2018-06-27 15:15:49 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://oversea-stable:4040 Spark context available as 'sc' (master = local[*], app id = local-1530083761130). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_172) Type in expressions to have them evaluated. Type :help for more information. scala>
(4) 在web 浏览器中查看spark master 的状态