In the world of NoSQL


Like the subject says, recently when decommissioning a node I needed to restart the NameNode.

After restarting it, it instantly crashed with the following error message:

2014-08-16 11:52:52,528 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3176 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from 1.2.3.4:38596: error: java.lang.AssertionError:  SafeMode: Inconsistent filesystem state: SafeMode data: blockTotal=1044077 blockSafe=1044078; BlockManager data: active=1044088
java.lang.AssertionError:  SafeMode: Inconsistent filesystem state: SafeMode data: blockTotal=1044077 blockSafe=1044078; BlockManager data: active=1044088
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.doConsistencyCheck(FSNamesystem.java:4286)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3960)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.checkMode(FSNamesystem.java:4091)
... cut ...
2014-08-16 11:52:52,749 FATAL org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: ReplicationMonitor thread received Runtime exception.
java.lang.AssertionError:  SafeMode: Inconsistent filesystem state: SafeMode data: blockTotal=1044077 blockSafe=1044078; BlockManager data: active=1044088
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.doConsistencyCheck(FSNamesystem.java:4286)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.isOn(FSNamesystem.java:3960)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem$SafeModeInfo.access$1400(FSNamesystem.java:3852)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isInSafeMode(FSNamesystem.java:4380)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3060)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
        at java.lang.Thread.run(Thread.java:722)
2014-08-16 11:52:52,752 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

I think the most interesting part of the above exception is:
Inconsistent filesystem state: SafeMode data: blockTotal=1044077 blockSafe=1044078; BlockManager data: active=1044088

blockSafe and BlockManager active are both higher than blockTotal, which looks suspicious.

My next step was to stop all DataNodes, and then try to start the cluster again. Same thing happened again every time I retried it.

After a few times I noticed that one DataNode was also crashing together with the NameNode.

2014-08-16 11:50:38,779 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
    Two block files with the same block id exist on disk:  /data/1/.../blk_1234 and /data/2/.../blk_1234

The solution was to start the NameNode and all DataNodes except the one that was crashing.

After waiting about 10-15 minutes I started the last DataNode, and it came back up without any problems. It seems that the NameNode required all the DataNodes that didn’t have duplicate blocks to settle before it was possible to start the DataNode with duplicate blocks.

§604 · augusti 16, 2014 · Uncategorized · · [Print]

Leave a Reply