学习笔记：spark Streaming的入门

时间 2020-06-20

原文原文链接

　spark Streaming的入门算法

　　　1.概述shell

　　　　　spark streaming 是spark core api的一个扩展，可实现实时数据的可扩展，高吞吐量，容错流处理。apache

　　　　　从上图能够看出，数据能够有不少来源，如kafka,flume,Twitter,HDFS/S3,Kinesis用的比较少；这些采集回来的数据可使用以高级的函数（map,reduce等）表达的复杂算法进行处理，通过sparkstreaming框架处理后的数据能够推送到文件系统，数据板或是实时仪表板上；除此以外，咱们还能够在数据流上应用spark的机器学习算法和图像处理算法。api

　　　　spark streaming简单的我的定义:将不一样数据源的数据通过spark Streaming框架处理以后将结果输出到外部文件系统。app

　　　　特色：框架

　　　　　　低延迟机器学习

　　　　　　能从错误中高效的恢复：fault-tolerantsocket

　　　　　　可以运行在成百上千的节点上函数

　　　　　　能将批处理、机器学习、图计算等子框架和spark streaming综合起来使用oop

　　 2.应用场景：

　　　　实时反映电子设备实时监测

　　　　交易过程当中实时的金融欺诈

　　　　电商行业的推荐信息

　　 3.集成spark生态系统的使用

　　　　　spark SQL、spark streaming、MLlib和GraphX都是基于spark core的扩展和开发，那它们是如何进行交互的？（后期补充）

　　4.spark的发展史

　　 5.从词频统计功能着手Spark Streaming入门

spark-submit执行（开发）

package org.apache.spark.examples.streaming import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Counts words in UTF8 encoded, '\n' delimited text received from the network every second. * * Usage: NetworkWordCount <hostname> <port> * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data. * * To run this on your local machine, you need to first run a Netcat server * `$ nc -lk 9999` * and then run the example * `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999` */
object NetworkWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: NetworkWordCount <hostname> <port>") System.exit(1) } StreamingExamples.setStreamingLogLevels() // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(1)) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } } 使用spark-submit方式提交的命令以下(不懂看代码前面的解析)： ./spark-submit --master local[2] --class org.apache.spark.examples.streaming.NetworkWordCount --name NetworkWordCount /home/hadoop/app/spark/eaxmple/jars/spark-example_2.11-2.2.20.jar  hadoop0000  9999

spark-shell执行(测试)

val ssc = new StreamingContext(sparkConf, Seconds(1)) val lines = ssc.socketTextStream("hadoop000", 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()

只须要运行./spark-shell --master loacal[2]，以后直接把代码拷贝上去运行便可。

　　 6.工做原理

　　　　　粗粒度：spark streaming接受实时数据流，把数据按照指定的时间段切成一片片小的数据块（spark streaming把每一个小的数据块当成RDD来处理），而后把这些数据块传给Spark Engine处理，处理完以后的结果也是分批次的返回。

　　　　细粒度：application中有两个context,SparkContext和StreamingContext，使用receiver来接收数据。run receivers as taskes去executor上请求数据，当executor接收到数据后会将数据按时间段进行切分并存放在内存中，如设置了多副本将会拷贝到其余的Exceutor上进行数据的备份(replicate blocks), exceutor的receiver会将blocks的信息告诉StreamingContext，每到指定的周期 StreamingContext 将会通知SparkContext启动jobs并把这些jobs分发到exceutor上执行。