Spark学习笔记之二

Spark学习笔记之二

一.RDD的五个基本特征

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition,[[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value pairs, such as groupByKey and join;[[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of Doubles;[[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that can be saved as SequenceFiles.
All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]) through implicit.
//任何正确类型的RDD 都可经过隐式转换自动得到全部操做。
Internally, each RDD is characterized by five main properties:mysql

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)sql

    All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. 数据库

二.Spark基础知识

1.若是想往hdfs中写入一个文件,可是若是这个文件已经存在,则会报错。
2.触发Action时才会造成一个完整的DAG,而后须要提交到集群上执行。
3.任务在提交到集群以前,须要作一些准备工做,这些准备工做都是在Driver端进行的。准备工做包括:apache

  • 01.构建DAG(有向无环图)
  • 02.将DAG切分红1到多个stage
  • 03.任务分阶段执行,须要按顺序提交stage。由于后执行的stage须要依赖先执行stage的结果
  • 04.一个stage会生成多个task,而后提交到executor中。stage生成Task的数量跟该阶段RDD的分区数量一致。【为了充分并行化】

4.RDD之间存在着依赖关系,依赖关系有两种网络

  • 01,窄依赖:不存在shuffle操做
  • 02,宽依赖:存在shuffle

6.广播变量并发

  • 01.将Driver端的变量广播到属于本身的全部的Executor中。这样是为了减小worker节点中内存的消耗,和网络带宽的消耗[若是每一个Executor进程都须要一个共享变量,那么一个worker中可能须要好几份变量值,这样就会增长网络带宽的消耗] 若是在Driver端建立一个变量,而且传入到RDD方法的函数中,方法中传入的函数是在Executor的Task中执行的,Driver会将这个在Driver中定义的变量方式给每个Task。
  • 02.Spark提供了一个模板,能够继承这个模板,做为广播变量。
  • 03.示例以下:
 
  

7.若将数据collect到Driver端,再写到MySQL中,不是最佳选择,理由以下:
01,浪费网络带宽,
02,不可以并发地写到数据库
8.result.foreach(data =>{
//不能每写一条,就建立一个JDBC链接,致使消耗很大资源
//val conn : Connection = DriverManager.getConnection(“……”)
})
9.正确的操做应该是,一次拿出来一个分区,分区用迭代器引用。即每写一个分区,才会建立一个链接,这样会大大减小资源的消耗。
result.foreachPartition(part=>{
dataToMysql(part)
})ide

def dataToMysql(part:Iterator[(String,Int)]={
val conn : Connection = DriverManager.getConnection(“jdbc:mysql://localhost:3306/……charactorEncoding=utf8”,”root”,”“)//再传入用户名,密码
val prepareStatement = conn.prepareStatement(“Insert into access_log (province,counts)
values (?,?)”)函数

//写入数据
part.foreach(data =>{
    prepareStatement.setString(1,data._1)
    prepareStatement.setString(2,data._2)
    prepareStatement.executeUpdate()
})
prepareStatement.close()
conn.close()

}学习

相关文章
相关标签/搜索