In the world of NoSQL

If you have limited connectivity between nodes, i.e. if half your datanodes are connected to one switch, and the other half to another, it’s wise to configure it so that the NameNode is aware of this.

To configure this, you set property topology.script.file.name to point to a script that reads the node names as arguments, and prints out the location of the node to standard out (space separated). It needs to be able to handle both hostname and IP address.

To configure the NameNode, add the following to /etc/hadoop/conf/hdfs-site.xml:

  <property>
    <name>topology.script.file.name</name>
    <value>/etc/hadoop/conf/topology.sh</value>
  </property>

My topology script looks like this:

#!/bin/bash
 
HADOOP_CONF=/etc/hadoop/conf
 
declare -A topology
while read host ip rack; do
  topology[$host]=$rack
  topology[$ip]=$rack
done < ${HADOOP_CONF}/topology.data
 
while [ -n "$1" ]; do
  echo -n "${topology[$1]:=/default/rack} "
  shift
done

And my topology.data contains:

node1.domain.com 1.2.3.1 /dc-se/rack-1
node2.domain.com 1.2.3.2 /dc-se/rack-2

Read more...

§590 · augusti 16, 2014 · Hadoop · (No comments) · Tags: , , ,


Got the following exception when starting the datanode after it had terminated due to a disk failure (without rebooting the server):

2013-10-11 11:24:02,122 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:718)
	at org.apache.hadoop.ipc.Server.bind(Server.java:403)
	at org.apache.hadoop.ipc.Server.bind(Server.java:375)
	at org.apache.hadoop.hdfs.net.TcpPeerServer.<init>(TcpPeerServer.java:106)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.initDataXceiver(DataNode.java:555)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:741)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:344)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1795)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1728)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1751)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1904)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1925)
2013-10-11 11:24:02,126 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2013-10-11 11:24:02,128 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at hbase10.network.com/1.2.3.4
************************************************************/

After an application crashes it might leave a lingering socket, so to reuse that socket early you need to set the socket flag SO_REUSEADDR when attempting to bind to it to be allowed to reuse it. The HDFS datanode doesn’t do that, and I didn’t want to restart the HBase regionserver (which was locking the socket with a connection it hadn’t realized was dead).
The solution was to bind to the port with an application that sets SO_REUSEADDR and then stop that application, I used netcat for that:

[root@hbase10 ~]#  nc -l 50010

Read more...

§577 · oktober 11, 2013 · Hadoop · (No comments) · Tags: , , , ,


When upgrading from CDH3 to CDH4 I came along the following problem when attempting to start the NameNode again:

2013-09-23 22:53:42,859 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /dfs/nn/in_use.lock acquired by nodename 32133@hbase1.network.com
2013-09-23 22:53:42,903 INFO org.apache.hadoop.hdfs.server.namenode.NNStorage: Using clusterid: CID-9ab09a80-a367-42d4-8693-6905b9c5a605
2013-09-23 22:53:42,913 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /dfs/nn/current
2013-09-23 22:53:42,928 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loading image file /dfs/nn/current/fsimage using no compression
2013-09-23 22:53:42,928 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files = 183625
2013-09-23 22:53:44,280 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Number of files under construction = 0
2013-09-23 22:53:44,282 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
java.lang.AssertionError: Should have reached the end of image file /dfs/nn/current/fsimage
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Loader.load(FSImageFormat.java:235)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:786)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:692)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:647)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:349)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:261)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:639)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:476)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:437)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:613)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:598)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1169)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1233)
2013-09-23 22:53:44,286 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2013-09-23 22:53:44,288 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hbase1.network.com/1.2.3.4
************************************************************/

Luckily I came across a post to the cdh-user mailing list by Bob Copeland containing:

I instrumented the code around the exception and found that the loader had
read
all but 16 bytes of the file, and the remaining 16 bytes are all zeroes. So
chopping off the last 16 bytes of padding was a suitable workaround, i.e.:

fsimage=/var/lib/hadoop/dfs/name/current/fsimage
cp $fsimage{,~}
size=$(stat -c %s $fsimage)
dd if=$fsimage~ of=$fsimage bs=$[size-16] count=1

Is this a known issue? I did all these tests in a scratch cdh3u5 VM and can
replicate at will if needed.

-Bob

Which solved my problems.

Ref: http://grokbase.com/p/cloudera/cdh-user/12ckdj9m47/cdh4-fsimage-upgrade-failure-workaround

Read more...

§558 · oktober 1, 2013 · Hadoop · (No comments) · Tags: , , , , ,


I ran out of space on the server running namenode, hbase master, hbase regionserver and a datanode and during the subsequent restarts hbase master wouldn’t start.
During log splitting it died with the following error:

2013-07-02 19:52:12,269 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.lang.AssertionError
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader$WALReaderFSDataInputStream.getPos(SequenceFileLogReader.java:112)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1491)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:57)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:158)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:648)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:834)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:750)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:283)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:202)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:275)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:205)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:408)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:301)
2013-07-02 19:52:12,271 INFO org.apache.hadoop.hbase.master.HMaster: Aborting

I found two ways to get it to start up again, the first one I tried was to move away the log splitting directory in hdfs with the following command (strongly discouraged to do this):

$ hadoop fs -mv /hbase/.logs/hbase1.domain.com,60020,1367325077343-splitting /user/hdfs

After some help from #hbase on irc.freenode.net I moved it back and tried starting hbase master with java assertions disabled, and that solved the issue.

To disable assertions in the JVM you make sure that the parameter -da (or -disableassertions) is passed to java when invoked.

I did this by editing /etc/hbase/conf/hbase-site.sh and adding -da to the HBASE_MASTER_OPTS environment variable.

Read more...



HBase crashed for me this night, due to the extra leap second inserted (2012-06-30 23:59:60).

 

When attempting to restart HBase, it just didn’t start. I found this resource for a tip that might work to get it up (although I found it after rebooting my servers, so I didn’t try it):
http://pedroalves-bi.blogspot.com/2012/07/java-leap-second-bug-how-to-fix-your.html

 

All java processes(including all HDFS-related) were using 100% CPU, together with ksoftirq. I turned off ntpd autostart(chkconfig ntpd off), and rebooted the servers, and then started my HBase cluster back up. This solved the issue.

Read more...

§163 · juli 1, 2012 · Hadoop, HBase · 1 comment · Tags: , ,


I’ve previously found a great addon to hadoop streaming called ”hadoop hbase streaming” which enables you to use a HBase table as input or output format for your hadoop streaming map reduce jobs, but it’s not been working since a recent API change.

The error it was saying was:

Error: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io.RowResult

I just found a fork of it on github by David Maust that has been updated for newer versions of HBase.

You can find the fork here:
https://github.com/dmaust/hadoop-hbase-streaming
And the original branch here:
https://github.com/wanpark/hadoop-hbase-streaming

Read more...

§98 · maj 2, 2012 · Hadoop, HBase · Kommentarer inaktiverade för Fork of hadoop-hbase-streaming with support for CDH3u3 · Tags: , , ,


I increased the number of map tasks in hadoop to 64 per TaskTracker, and TaskTracker started to crash every time I launched a map reduce job.

 

Errors were:

java.lang.OutOfMemoryError: unable to create new native thread

And:

org.apache.hadoop.mapred.DefaultTaskController: Unexpected error launching task JVM java.io.IOException: Cannot run program ”bash” (in directory ”/data/1/mapred/local/taskTracker/hdfs/jobcache/job_201110201642_0001/attempt_201110201642_0001_m_000031_0/work”): error=11,  Resource temporarily unavailable.

 

Googling for this problem presented the following solutions:

  1. Increase the heap size for the TaskTracker, I did this by changing HADOOP_HEAPSIZE to 4096 in /etc/hadoop/conf/hadoop-env.sh to test.  This did not solve it.
  2. Increase the heap size for the spawned child.  Add -Xmx1024 in mapred-site.xml for mapred.map.child.java.opts.  This did not solve it.
  3. Make sure that the limit of open files is not reached, I had already done this by adding ”mapred – nofile 65536” in /etc/security/limits.conf.  This did not solve it.

I decided to sudo to the mapred user and check the ulimits again, what I noticed that was off was:

max user processes              (-u) 1024

 

Adding the following to /etc/security/limits.conf and restarting the TaskTracker solved it:

mapred – nproc 8192

 

Apparently CentOS limits the number of processes for regular users to 1024 by default.

Read more...

§50 · oktober 20, 2011 · Hadoop · Kommentarer inaktiverade för Hadoop TaskTracker java.lang.OutOfMemoryError · Tags: , , ,