In this article, we will discuss Bloom filters.
An HBase Bloom Filter is an efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. Without Bloom Filter, the only way to decide if a row key is contained in a StoreFile is to check the StoreFile’s block index, which stores the start row key of each block in the StoreFile. BloomFilters provide an in-memory structure to reduce disk reads to only the files likely to contain that Row. In short, it can be considered as an in-memory index to determine the probability of finding a row in a particular StoreFile.
If your application usually modifies all or the majority of the rows of Hbase on a regular basis, the majority of StoreFiles will have a piece of the row you are searching. Thus Bloom filters may not help a lot. In case of time series data where we update only a few records at a time or when data is updated in batches, each row is written in separate Storefile. In this case, Bloom filter helps a lot in improving the performance of HBase reads by discarding Store files that do not contain the row being searched.
We used airline traffic data for experiments on Bloom filter. About 5 million records from this dataset were loaded in HBase table. Following are the results:
After testing the above settings on test data of about 10 GB, we implemented the same in streaming data HBase database. We observed a performance gain in line with the above experimental results.