Spark SQL编程之DataSet篇es6
做者:尹正杰sql
版权声明:原创做品,谢绝转载!不然将追究法律责任。apache
一.建立DataSet编程
舒适提示: Dataset是具备强类型的数据集合,须要提供对应的类型信息。下面是具体案例。 scala> case class Person(name: String, age: Long) #建立一个样例类 defined class Person scala> val caseClassDS = Seq(Person("YinZhengjie", 18)).toDS() #建立DataSet caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> caseClassDS.show #不难发现DataSet的方法和DataFrame的方法使用上很类似。 +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| +-----------+---+ scala> caseClassDS.createTempView("person") scala> spark.sql("select * from person").show +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| +-----------+---+ scala>
二.RDD转换为DataSetspa
scala> case class Person(name: String, age: Long) #建立一个样例类 defined class Person scala> val listRDD = sc.makeRDD(List(("YinZhengjie",18),("Jason Yin",20),("Danny",28))) #建立一个RDD listRDD: org.apache.spark.rdd.RDD[(Int, String, Int)] = ParallelCollectionRDD[84] at makeRDD at <console>:27 scala> val mapRDD = listRDD.map( t => { Person( t._1,t._2) }) #使用map算子将listRDD各元素转换成Person对象 mapRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[102] at map at <console>:30 scala> val ds = mapRDD.toDS #将rdd转换为DataSet ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> ds.show +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| | Jason Yin| 20| | Danny| 28| +-----------+---+ scala>
三.DataSet转换为RDDscala
scala> ds.show #查看DataSet数据 +-----------+---+ | name|age| +-----------+---+ |YinZhengjie| 18| | Jason Yin| 20| | Danny| 28| +-----------+---+ scala> ds res6: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] scala> ds.rdd #将DataSet转换成RDD res7: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[26] at rdd at <console>:29 scala> res7.collect #查看RDD的数据 res8: Array[Person] = Array(Person(YinZhengjie,18), Person(Jason Yin,20), Person(Danny,28)) scala>