Things to do to get best performance from HBase
This HBase performance tuning article is a result of several implementations that we have done over the period. We can now confidently narrate a few points which if implemented correctly, can make HBase very efficient and work as per your requirements.
Any discussion on Big Data invariably leads to Hadoop. However, another approach for effectively managing Big Data is by using NoSQL databases which are becoming mainstream. A variety of NoSQL databases are widely used today, a few of which are Cassandra, MongoDB, CouchDB, Neo4j, Accumulo, MarkLogic, Redis and HBase.
HBase holds a very important place in the list of NoSQL databases as it works on Hadoop by using HDFS as the underlying data store.
HBase is being used widely for a variety of use cases. Though HBase and other NoSQL databases are getting deployed in production, SQL is still the most preferred way to access databases for the end user of data. Usage of SQL layers such as Hive, Impala or more recently Phoenix for querying HBase is inevitable.
Over last couple of years, we have experienced a variety of scenarios in querying of HBase databases using SQL engines and every time, the question which comes to the fore relates to the things to be done to improve the performance of a HBase database. Having done several implementations, we can confidently list down a few points which if implemented correctly, can make HBase very efficient and work as per your requirements –
- Define and use the Rowkey in the right manner
- Use Compression of data
- Use Bloom Filters
A) Define and use Rowkey in the right manner
Why is Rowkey important – For any HBase database, Rowkey forms the crux of storing and accessing data. HBase is optimized for reads when data is queried on basis of row key. Query predicates are applied to row key as start and stop keys. Query engines such as Impala, Hive and BigInsights reading data from HBase translate predicates against efficient row key lookup when operators such as =, <, BETWEEN are applied against row key. Reading data from HBase based on rowkey is somewhat similar to using index for reading a RDBMS table. If you use the index in your query, you read only required records, else you scan the entire table which is not efficient at all.
Use Case – In one of the implementations, HBase table contained 50 million records and the query was written to retrieve 10 records. We were using Impala external tables to read HBase. One would expect this query to be very fast, but it was taking a long time. We noticed that the query had used ‘LIKE’ condition against the rowkey and hence the predicate was not getting pushed down against the rowkey. HBase was scanning millions of records and then returning 10 records. Changing ‘LIKE’ to ‘BETWEEN’ did the magic and the query execution time came down from 2 minutes to 1.5 seconds.
Let us get down to details as to why this happened.
Our HBase table was a time series table with “Parameter”,”Value” and “Timestamp” as columns. The rowkey was a concatenation of “Parameter-Timestamp”. An external table was created in Impala against this HBase table with columns “Rowkey”,”Parameter”,”Value” and “Timestamp”. Rowkey was defined as a string column. Our objective of the query is to find sum of values for a specific parameter (say P11) for a specific range of timestamp.
All the below queries resulted in HBase table scan and performed inefficiently.
SELECT SUM(VALUE) FROM TABLE WHERE PARAMETER = ‘P11’ AND TIMESTAMP BETWEEN ‘T1’ AND ‘T2’; (rowkey not used)
SELECT SUM(VALUE) FROM TABLE WHERE ROWKEY LIKE ‘P11%’ AND PARAMETER = ‘P11’ AND TIMESTAMP BETWEEN ‘T1’ AND ‘T2’;
Correct query to make use of rowkey lookup is like this:
SELECT SUM(VALUE) FROM TABLE WHERE ROWKEY BETWEEN ‘P10’ AND ‘P12’ AND PARAMETER = ‘P11’ AND TIMESTAMP BETWEEN ‘T1’ AND ‘T2’;
Similarly comparing rowkey against a non constant value will result in scanning of entire table.
Summary – If you plan to scan to entire HBase table or majority of it, probably you are using HBase for wrong purpose. HBase is most efficient for reading a single row or a range of rows.
If you are not using a filter against rowkey column in your query, your rowkey design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data.
When creating external tables in Hive / Impala against HBase tables, map the HBase rowkey against a string column in Hive / Impala. If this is not done, rowkey is not used in the query and entire table is scanned.
Carefully evaluate your query to check if it will result in rowkey predicates. Use of “LIKE” against rowkey column does not result in rowkey predicate and it results in scanning entire table.
Obvious question for a developer will be ‘How do I know whether I am writing correct query against a table’? This can be easily answered by using “Explain Plan” feature in Hive / Impala to evaluate query behavior.
B) Use Compression of Hbase tables
Why is compression important – Disk performance is almost always bottleneck in Hadoop clusters. This is because Hadoop and HBase jobs are data intensive, thus making data reads a bottleneck in overall application. By using compression, the data occupies less space on disk. Thus, the process of reading data must take place on smaller, compressed data size. This leads to increased performance of reads. On the other hand, with compression comes the need to un-compress the data after reading and this results in increase in CPU load.
There are 3 compression algorithms that can be applied on HBase data: LZO, GZIP and Snappy. GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. GZIP is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.
Use Case – We performed a benchmarking of compression performance using Snappy compression. For this we used airline traffic data. About 5 million records from this dataset were loaded in HBase table. We created 2 tables. One was created without compression and second table was created with Snappy compression. An external table in impala was created on top of the HBase table and a set of queries were executed on both the tables.
We executed following queries on the data:
Query 1 – select dayofmonth, avg(deptime) from airline1 group by dayofmonth;
Query 2 – select DayofMonth ,sum(deptime) from airline2 where DayofMonth in (12,18) and month in (1,4,6,9) group by DayofMonth ;
All other parameters of the system (i.e. number of nodes, memory, CPU etc) remained unchanged.
Results – Following were the readings obtained after 7 distinct executions of each query.
A typical question is whether compression can be used after data is loaded in HBase?
You can! You can alter the table containing data and the next time when major compaction takes place, new compression will be applied to the table. A major compaction does clean-up of StoreFiles and ensures that all data that is owned by a region server is local to that server.
Thus, we observed that using Snappy compression increases the read performance by 14% for first query and 8% for second query. Thus overall, there was a performance gain observed by about 11% by using Snappy compression.
After testing on test data of about 10 GB, we implemented the compression on streaming data Hbase table in production. We observed a performance gain in line with above experimental results.
C) Use Bloom Filters
What is a Bloom Filter – An HBase Bloom Filter is an efficient mechanism to test whether a StoreFile contains a specific row or row-column cell. Without Bloom Filter, the only way to decide if a row key is contained in a StoreFile is to check the StoreFile’s block index, which stores the start row key of each block in the StoreFile. Bloom Filters provide an in-memory structure to reduce disk reads to only the files likely to contain that Row. In short it can be considered as an in-memory index to determine probability of finding a row in a particular StoreFile.
If your application usually modifies all or majority of the rows of HBase on regular basis, majority of StoreFiles will have a piece of the row you are searching. Thus, Bloom filters may not help a lot. In case of time series data where we update only a few records at a time or in case when data is updated in batches, each row is written in separate Storefile. In this case, Bloom filter helps a lot in improving performance of HBase reads by discarding Storefiles that do not contain the row being searched.
Use Case – We used airline traffic data for experiments on Bloom filter. About 5 million records from this dataset were loaded in HBase table. Following are the results:
After testing on above settings on test data of about 10 GB, we implemented the same in streaming data HBase database. We observed a performance gain in line with above experimental results.
As you would notice, one can improve performance of HBase by implementing the above points related to Rowkey, Compression and Bloom Filters.