spark Streaming的入门算法
1.概述shell
spark streaming 是spark core api的一个扩展,可实现实时数据的可扩展,高吞吐量,容错流处理。apache
从上图能够看出,数据能够有不少来源,如kafka,flume,Twitter,HDFS/S3,Kinesis用的比较少;这些采集回来的数据可使用以高级的函数(map,reduce等)表达的复杂算法进行处理,通过sparkstreaming框架处理后的数据能够推送到文件系统,数据板或是实时仪表板上;除此以外,咱们还能够在数据流上应用spark的机器学习算法和图像处理算法。api
spark streaming简单的我的定义:将不一样数据源的数据通过spark Streaming框架处理以后将结果输出到外部文件系统。app
特色:框架
低延迟机器学习
能从错误中高效的恢复:fault-tolerantsocket
可以运行在成百上千的节点上函数
能将批处理、机器学习、图计算等子框架和spark streaming综合起来使用oop
2.应用场景:
实时反映电子设备实时监测
交易过程当中实时的金融欺诈
电商行业的推荐信息
3.集成spark生态系统的使用
spark SQL、spark streaming、MLlib和GraphX都是基于spark core的扩展和开发,那它们是如何进行交互的?(后期补充)
4.spark的发展史
5.从词频统计功能着手Spark Streaming入门
package org.apache.spark.examples.streaming import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Counts words in UTF8 encoded, '\n' delimited text received from the network every second. * * Usage: NetworkWordCount <hostname> <port> * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data. * * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */ object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 使用spark-submit方式提交的命令以下(不懂看代码前面的解析): ./spark-submit --master local[2] --class org.apache.spark.examples.streaming.NetworkWordCount --name NetworkWordCount /home/hadoop/app/spark/eaxmple/jars/spark-example_2.11-2.2.20.jar hadoop0000 9999
val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.socketTextStream("hadoop000", 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
只须要运行./spark-shell --master loacal[2],以后直接把代码拷贝上去运行便可。
6.工做原理
粗粒度:spark streaming接受实时数据流,把数据按照指定的时间段切成一片片小的数据块(spark streaming把每一个小的数据块当成RDD来处理),而后把这些数据块传给Spark Engine处理,处理完以后的结果也是分批次的返回。
细粒度:application中有两个context,SparkContext和StreamingContext,使用receiver来接收数据。run receivers as taskes去executor上请求数据,当executor接收到数据后会将数据按时间段进行切分并存放在内存中,如设置了多副本将会拷贝到其余的Exceutor上进行数据的备份(replicate blocks), exceutor的receiver会将blocks的信息告诉StreamingContext, 每到指定的周期 StreamingContext 将会通知SparkContext启动jobs并把这些jobs分发到exceutor上执行。