Bloom filters for HBase

About Kuldeep Deshpande
An HBase Bloom Filter is an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell.
Without Bloom filters, The only way to find a row key in a StoreFile is to check the store file’s block index. The StoreFile’s block index stores the start row key of each block in the StoreFile. Bloom Filters provide an in-memory structure to reduce disk reads to only the files likely to contain that Row. In short, it can be considered as an in-memory index to find a row in a StoreFile
If your application usually modifies all or the majority of the rows of Hbase on a regular basis, the majority of StoreFiles will have a piece of the row you are searching for. Thus Bloom filters may not help a lot. In time-series data when few records are updated at a time, or when updated in batches, each row is written in separate Store file.
In this case, Bloom filter helps a lot in improving the performance of HBase reads. It is done by discarding Store files that do not contain the row being searched.
We used airline traffic data for experiments on Bloom filter. About 5 million records from this dataset were loaded in the HBase table. Following are the results: