1、问题背景:
1.云主机是 Linux 环境,搭建 Hadoop 伪分布式
公网 IP:139.198.18.xxx
内网 IP:192.168.137.2
主机名:hadoop001
2.本地的core-site.xml配置以下:java
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop001:9001</value> </property> <property> <name>hadoop.tmp.dir</name> <value>hdfs://hadoop001:9001/hadoop/tmp</value> </property> </configuration>
3.本地的hdfs-site.xml配置以下:node
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
4.云主机hosts文件配置:apache
[hadoop@hadoop001 ~]$ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 # hostname loopback address 192.168.137.2 hadoop001
云主机将内网IP和主机名hadoop001作了映射
5.本地hosts文件配置浏览器
139.198.18.XXX hadoop001
本地已经将公网IP和域名hadoop001作了映射
2、问题症状
1.在云主机上开启 HDFS,JPS 查看进程都没有异常,经过 Shell 操做 HDFS 文件也没有问题
2.经过浏览器访问 50070 端口管理界面也没有问题
3.在本地机器上使用 Java API 操做远程 HDFS 文件,URI 使用公网 IP,代码以下:服务器
val uri = new URI("hdfs://hadoop001:9001") val fs = FileSystem.get(uri,conf) val listfiles = fs.listFiles(new Path("/data"),true) while (listfiles.hasNext) { val nextfile = listfiles.next() println("get file path:" + nextfile.getPath().toString()) } ------------------------------运行结果--------------------------------- get file path:hdfs://hadoop001:9001/data/infos.txt
4.在本地机器使用SparkSQL读取hdfs上的文件并转换为DF的过程当中app
object SparkSQLApp { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("SparkSQLApp").master("local[2]").getOrCreate() val info = spark.sparkContext.textFile("/data/infos.txt") import spark.implicits._ val infoDF = info.map(_.split(",")).map(x=>Info(x(0).toInt,x(1),x(2).toInt)).toDF() infoDF.show() spark.stop() } case class Info(id:Int,name:String,age:Int) }
出现以下报错信息:dom
.... .... .... 19/02/23 16:07:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 19/02/23 16:07:00 INFO HadoopRDD: Input split: hdfs://hadoop001:9001/data/infos.txt:0+17 19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ..... .... 19/02/23 16:07:21 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry... 19/02/23 16:07:21 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 272.617680460432 msec. 19/02/23 16:07:42 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ... ... 19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499) ... ... 19/02/23 16:08:12 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection timed out: no further information java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) ... ... 19/02/23 16:08:12 INFO DFSClient: Could not obtain BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 from any node: java.io.IOException: No live nodes contain block BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 after checking nodes = [DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Will get new block locations from namenode and retry... 19/02/23 16:08:12 WARN DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 11918.913311370841 msec. 19/02/23 16:08:45 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ... ... 19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException 19/02/23 16:08:45 WARN DFSClient: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt No live nodes contain current block Block locations: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK] Dead nodes: DatanodeInfoWithStorage[192.168.137.2:50010,DS-fb2e7244-165e-41a5-80fc-4bb90ae2c8cd,DISK]. Throwing a BlockMissingException 19/02/23 16:08:45 WARN DFSClient: DFS Read org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) ... ... 19/02/23 16:08:45 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:648) ... ... 19/02/23 16:08:45 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 19/02/23 16:08:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/02/23 16:08:45 INFO TaskSchedulerImpl: Cancelling stage 0 19/02/23 16:08:45 INFO DAGScheduler: ResultStage 0 (show at SparkSQLApp.scala:30) failed in 105.618 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) ... ... Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1358284489-192.168.137.2-1550394746448:blk_1073741840_1016 file=/data/infos.txt at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1001) ... ...
3、问题分析
1.本地 Shell 能够正常操做,排除集群搭建和进程没有启动的问题
2.云主机没有设置防火墙,排除防火墙没关的问题
3.云服务器防火墙开放了 DataNode 用于数据传输服务端口 默认是 50010
4.我在本地搭建了另外一台虚拟机,该虚拟机和本地在同一局域网,本地能够正常操做该虚拟机的hdfs,基本肯定了是因为内外网的缘由。
5.查阅资料发现 HDFS 中的文件夹和文件名都是存放在 NameNode 上,操做不须要和 DataNode 通讯,所以能够正常建立文件夹和建立文件说明本地和远程 NameNode 通讯没有问题。那么极可能是本地和远程 DataNode 通讯有问题
4、问题猜测
因为本地测试和云主机不在一个局域网,hadoop配置文件是之内网ip做为机器间通讯的ip。在这种状况下,咱们可以访问到namenode机器,namenode会给咱们数据所在机器的ip地址供咱们访问数据传输服务,可是当写数据的时候,NameNode 和DataNode 是经过内网通讯的,返回的是datanode内网的ip,咱们没法根据该IP访问datanode服务器。
咱们来看一下其中一部分报错信息:分布式
19/02/23 16:07:21 WARN BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection timed out: no further information ... 19/02/23 16:07:42 WARN DFSClient: Failed to connect to /192.168.137.2:50010 for block, add to deadNodes and continue....
从报错信息中能够看出,链接不到192.168.137.2:50010,也就是datanode的地址,由于外网必须访问“139.198.18.XXX:50010”才能访问到datanode。
为了可以让开发机器访问到hdfs,咱们能够经过域名访问hdfs,让namenode返回给咱们datanode的域名。
5、问题解决
1.尝试一:
在开发机器的hosts文件中配置datanode对应的外网ip和域名(上文已经配置),而且在与hdfs交互的程序中添加以下代码:oop
val conf = new Configuration() conf.set("dfs.client.use.datanode.hostname", "true")
报错依旧
2.尝试二:测试
val spark = SparkSession .builder() .appName("SparkSQLApp") .master("local[2]") .config("dfs.client.use.datanode.hostname", "true") .getOrCreate()
报错依旧
3.尝试三:
在hdfs-site.xml中添加以下配置:
<property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
运行成功
经过查阅资料,建议在hdfs-site.xml中增长dfs.datanode.
use.datanode.hostname属性,表示datanode之间的通讯也经过域名方式
<property> <name>dfs.datanode.use.datanode.hostname</name> <value>true</value> </property>
这样可以使得更换内网IP变得十分简单、方便,并且可让特定datanode间的数据交换变得更容易。但与此同时也存在一个反作用,当DNS解析失败时会致使整个Hadoop不能正常工做,因此要保证DNS的可靠