What if your HBase seems to be working but is still not working!

What if your HBase seems to be working but is still not working!
Ellicium HBase Blog_ 2
Jagadeesan A S
Posted by on May 9, 2017 in Blog

HBase is used as a robust NoSQL database for handling voluminous data. It is based on the Hadoop platform so preferred by those who use Hadoop extensively. HBase focuses on Availability and Partition-Tolerance (A and P of the CAP Theorem) and is favored for data which can be represented as key-value pairs.

We employ HBase as a popular data-store for several of our clients who handle streaming data. The distributed master-slave architecture augurs well for several purposes.

Having worked on HBase, we have encountered several scenarios wherein HBase has reported problems. Each case was an interesting one and with the help of our knowledge of the technology, good experience, data from the log files and sometimes plain common-sense, we were able to comfortably get out of the problem. However, recently, we encountered a unique case in our HBase distribution – HBase seemed to be working but was actually not working!

I know that this sounds plain stupid but it is exactly what I have written. We are working on one of the interesting real time Internet of Things (IoT) project development.  We were using CDH 5.11 as the Hadoop distribution and the status of HBase on the Cloudera Manager console showed green/amber. However, we were unable to fire Impala queries on HBase tables and get data. Restarting HBase and the cluster did not help as there was nothing which seemed to be a problem. The Stdout and Stderr logs of the HBase Master instance also did not indicate an error. The HBase logs indicated some problems though – it seemed that the master was unable to initialize. Closer inspection of the error messages revealed that there was a problem when HBase was trying to split the log files.

Few recommended solutions were as following –

  1. Increase the configuration parameter for xxxxxxxx from the default value of 300000 ms
  2. Check Zookeeper for errors
  3.  Stop HBase service, restart all Regionservers first and then start the Master

Lastly, another gem of a solution

4.  Delete HBase and reinstall the service!!

None of the above (we did not dare try the fourth one!) worked though I’m sure they might have worked in a few cases.

On closer look, we realised that the error pointed to a “splitting” file on one of the Regionservers. We checked the Write Ahead Logs and noticed that for each of the Regionservers, there was indeed a Splitting file under the directory “/hbase/WALs”.

When none of our approaches worked, we decided to try out the option of deleting the -splitting files. Since this was a part of the WAL, we had to be careful else we would have lost data. The steps we followed were as follows –

  • Stop the data processing on HBase
  • Bring down HBase
  • Manually delete the -splitting files from the Regionservers
  • Restart HBase

And voila! HBase was not only up and running but also started accepting queries and processing them.

Why splitting happens?

Before understanding split policy of HBase, it is essential to know how HBase write process happens.

HBase Write Path

Source: http://bit.ly/2qUJrY8

 

As per split policy of HBase, it splits the region when the total data size for one of the stores (corresponding to a column family) in the region gets bigger than the configured “hbase.hregion.max.filesize”, which has a default value of 10GB.

HBase Log Splitting

Source: http://bit.ly/2qM7kEK

 

In case, if cluster is down due to some issue, HBase services won’t restart. It is because temporary files do not get flushed out while trying to restart. Hence, you need to delete those corrupted split files manually.

Deleting anything from HBase system file is a tad risky.

Good understanding of the architecture and various components is very essential prior to trying it out. Our team was glad that we figured out a way of taming an unexpected behavior of HBase. And our client was happy that we resolved it..!