解决hadoop集群中datanode启动后自动关闭的问题

时间 2019-11-17

标签解决 hadoop 集群 datanode 启动自动关闭问题栏目 Hadoop 繁體版

原文原文链接

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /var/lib/hadoop-0.20/cache/hdfs/dfs/data: namenode namespaceID = 240012870; datanode namespaceID = 1462711424 . java

　　问题：Namenode上namespaceID与datanode上namespaceID不一致。 node

　　问题产生缘由：每次namenode format会从新建立一个namenodeId,而tmp/dfs/data下包含了上次format下的id,namenode format清空了namenode下的数据,可是没有清空datanode下的数据,因此形成namenode节点上的namespaceID与datanode节点上的namespaceID不一致。启动失败。 apache

　　第一种解决方法：即: app

　　(1)停掉集群服务 ide

　　(2)在出问题的datanode节点上删除data目录，data目录便是在hdfs-site.xml文件中配置的dfs.data.dir目录，本机器上那个是/var/lib/hadoop-0.20/cache/hdfs/dfs/data/ (注：咱们当时在全部的datanode和namenode节点上均执行了该步骤。以防删掉后不成功，能够先把data目录保存一个副本). oop

　　(3)格式化namenode. this

　　(4)从新启动集群。 idea

　　问题解决。 spa

这种方法带来的一个反作用便是，hdfs上的全部数据丢失。若是hdfs上存放有重要数据的时候，不建议采用该方法，能够尝试提供的网址中的第二种方法。

下面给出两种解决办法，我使用的是第二种。 rest

Workaround 1: Start from scratch

I can testify that the following steps solve this error, but the side effects won't make you happy (me neither). The crude workaround I have found is to:

1. stop the cluster

2. delete the data directory on the problematic datanode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data

3. reformat the namenode (NOTE: all HDFS data is lost during this process!)

4. restart the cluster

When deleting all the HDFS data and starting from scratch does not sound like a good idea (it might be ok during the initial setup/testing), you might give the second approach a try.

Workaround 2: Updating namespaceID of problematic datanodes

Big thanks to Jared Stehler for the following suggestion. I have not tested it myself yet, but feel free to try it out and send me your feedback. This workaround is "minimally invasive" as you only have to edit one file on the problematic datanodes:

1. stop the datanode

2. edit the value of namespaceID in <dfs.data.dir>/current/VERSION to match the value of the current namenode

3. restart the datanode

If you followed the instructions in my tutorials, the full path of the relevant file is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data/current/VERSION (background: dfs.data.dir is by default set to ${hadoop.tmp.dir}/dfs/data, and we set hadoop.tmp.dir to /usr/local/hadoop-datastore/hadoop-hadoop).

If you wonder how the contents of VERSION look like, here's one of mine:

#contents of <dfs.data.dir>/current/VERSION

namespaceID=393514426

storageID=DS-1706792599-10.10.10.1-50010-1204306713481

cTime=1215607609074

storageType=DATA_NODE

layoutVersion=-13

缘由:每次namenode format会从新建立一个namenodeId,而tmp/dfs/data下包含了上次format下的id,namenode format清空了namenode下的数据,可是没有晴空datanode下的数据,致使启动时失败,所要作的就是每次fotmat前,清空tmp一下的全部目录.