今天将代码以Spark On Yarn Cluster
的方式提交,遇到了不少不少问题.特意记录一下.java
代码经过--master yarn-client
提交是没有问题的,可是经过--master yarn-cluster
老是报错,并且是各类各样的错误.node
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1405) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2284) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427) ...
这个bug一般会提示咱们是否将Jar
包部署到全部的slave
上了,可是yarn-cluster通常会经过RPC
框架分发Jar包,即便将Jar
包一一部署到slave机器中,并无任何效果,仍然报这个错误.web
开始经过google
,stackoverflow
查找相关信息.产生这种问题的缘由可谓错综复杂,有的说类加载器的问题,有的说UDF的问题.其中有一个引发了个人注意:sql
若是在代码中引用了
Java
代码,最好将代码打成的Jar
放在$SPARK_HOME/jars
目录下,确保jar包是在classpath
下.shell
按照这个解答的方式安排了一下jar
包,而后从新执行.经过yarn的web页面观察运行日志,没有这个报错了.可是任务失败了,报了另外一个错误:apache
java.io.FileNotFoundException: File does not exist: hdfs://master:9000/xxx/xxxx/xxxx/application_1495996836198_0003/__spark_libs__1200479165381142167.zip at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ...
这个错误就让我很熟悉了,我在代码建立sparkSession
的时候设置了master
,master
地址是spark master
的url
,因此当在yarn上提交任务的时候,最终会按照代码中的配置开始standalone
模式,这会形成混乱,因此会产生一些莫名其妙的bug.app
修改一下代码从新打包就行了框架
解决办法:oop
val spark = SparkSession.builder() // .master("spark://master:7077") //注释掉master的设置 .appName("xxxxxxx") .getOrCreate();
中间还遇到了其余不少bug,好比没法反序列化ui
SerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
再或者这种类型转换错误
org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: scala.Tuple2 cannot be cast to com.xxx.xxxxx.ResultMerge
这些报错经过注释掉master
的设置后都会消失.
各类异常交错出现,这是很容易让人迷惑的.
幸亏最后报了一个熟悉的错误java.io.FileNotFoundException
,问题才得以解决.
报错以下:
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1729427003-192.168.1.219-1527744820505:blk_1073742492_1669; getBlockSize()=24; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[192.168.1.219:50010,DS-e478076c-c3aa-4870-adce-7ffd6a49efe4,DISK], DatanodeInfoWithStorage[192.168.1.21:50010,DS-af806575-7404-45fd-bae0-0fcc59de7598,DISK]]}
这是由于在操做一个正在写入的hdfs
文件,一般可能出如今flume写入的文件未正常关闭,或者hdfs重启致使的文件问题.
能够经过命令查看一下哪些文件是OPENFORWRITTING
或者MISSING
:
hadoop fsck / -openforwrite | egrep "MISSING|OPENFORWRITE"
经过上面的命令能够肯定具体文件,而后将其删除便可.