SparkSession - Spark SQL 的入口

时间 2019-11-20

标签 sparksession spark sql 入口栏目 Spark 繁體版

原文原文链接

SparkSession - Spark SQL 的入口

翻译自：https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-SparkSession.htmlhtml

概述

SparkSession 是 Spark SQL 的入口。使用 Dataset 或者 Datafram 编写 Spark SQL 应用的时候，第一个要建立的对象就是 SparkSession。git

Note：在 Spark 2.0 中， SparkSession 合并了 SQLContext 和 HiveContext。sql

你能够经过 SparkSession.builder 来建立一个 SparkSession 的实例,并经过 stop 函数来中止 SparkSession。apache

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // avoid hardcoding the deployment environment
  .enableHiveSupport()              // self-explanatory, isn't it?
  .config("spark.sql.warehouse.dir", "target/spark-warehouse")
  .getOrCreate

你能够在一个 Spark 应用中使用多个 SparkSession，这样子就能够经过 SparSession 将多个关系实体隔离开来(能够参考 catalog 属性)。安全

scala> spark.catalog.listTables.show
+------------------+--------+-----------+---------+-----------+
|              name|database|description|tableType|isTemporary|
+------------------+--------+-----------+---------+-----------+
|my_permanent_table| default|       null|  MANAGED|      false|
|              strs|    null|       null|TEMPORARY|       true|
+------------------+--------+-----------+---------+-----------+

在 SparkSession 的内部，包含了SparkContext， SharedState，SessionState 几个对象。下表中介绍了每一个对象的大致功能：session

Name	Type	Description
sparkContext	SparkContext	spark功能的主要入口点。能够经过 sparkConext在集群上建立RDD, accumulators 和 broadcast variables
existingSharedState	Option[SharedState]	一个内部类负责保存不一样session的共享状态
parentSessionState	Option[SessionState]	复制父session的状态

下图是 SparkSession 的类和方法, 这些方法包含了建立 DataSet, DataFrame, Streaming 等等。app

Method	Description
builder	"Opens" a builder to get or create a SparkSession instance
version	Returns the current version of Spark.
implicits	Use import spark.implicits._ to import the implicits conversions and create Datasets from (almost arbitrary) Scala objects.
emptyDataset[T]	Creates an empty Dataset[T].
range	Creates a Dataset[Long].
sql	Executes a SQL query (and returns a DataFrame).
udf	Access to user-defined functions (UDFs).
table	Creates a DataFrame from a table.
catalog	Access to the catalog of the entities of structured queries
read	Access to DataFrameReader to read a DataFrame from external files and storage systems.
conf	Access to the current runtime configuration.
readStream	Access to DataStreamReader to read streaming datasets.
streams	Access to StreamingQueryManager to manage structured streaming queries.
newSession	Creates a new SparkSession.
stop	Stops the SparkSession.

Builder

Builder 是 SparkSession 的构造器。经过 Builder, 能够添加各类配置。
Builder 的方法以下：函数

Method	Description
getOrCreate	获取或者新建一个 sparkSession
enableHiveSupport	增长支持 hive Support
appName	设置 application 的名字
config	设置各类配置

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // avoid hardcoding the deployment environment
  .enableHiveSupport()              // self-explanatory, isn't it?
  .getOrCreate

ShareState

ShareState 是 SparkSession 的一个内部类，负责保存多个有效session之间的共享状态。下表介绍了ShareState的属性。oop

Name	Type	Description
cacheManager	CacheManager	这个是 SQLContext 的支持类，会自动保存 query 的查询结果。这样子查询在执行过程当中，就能够使用这些查询结果
externalCatalog	ExternalCatalog	保存外部系统的 catalog
globalTempViewManager	GlobalTempViewManager	一个线程安全的类，用来管理 global temp view，并提供 create , update , remove 的等原子操做，来管理这些 view
jarClassLoader	NonClosableMutableURLClassLoader	加载用户添加的 jar 包
listener	SQLListener	一个监听类
sparkContext	SparkContext	Spark 的核心入口类
warehousePath	String	MetaStore 的地址，能够经过 spark.sql.warehouse.dir 或者 hive-site.xml 中的 hive.metastore.warehouse.dir 来指定， Spark 会覆盖 hive 的参数

ShareState 会使用一个 sparkContext 做为构造参数。若是能够在 CLASSPATH 中找到 hive-site.xml，ShareState 会将它加入到 sparkContext 的 hadoop configuration 中。ui

经过设置 log4j.logger.org.apache.spark.sql.internal.SharedState=INFO 能够看到相应的日志。

SparkSession - Spark SQL 的 入口

SparkSession - Spark SQL 的 入口

概述

Builder

ShareState

SparkSession - Spark SQL 的入口

SparkSession - Spark SQL 的入口