Enemy #1 for Hadoop adoption – Bad Data Quality

Enemy #1 for Hadoop adoption – Bad Data Quality

Enemy #1 for Hadoop adoption – Bad Data Quality

“Fast is fine but accuracy is everything”  – Wyatt Earp

What Wyatt Earp said about gunfights in the Wild West strangely applies to Big Data / Hadoop / IoT movement that is equally wild! In their 2015 Big Data market update, Forbes confirmed that quality of data is the biggest problem in developing big data applications.

In fact, Poor Data Quality is the biggest

challenge to Hadoop adoption at Production level.

Initial enthusiasm in enterprises about Big Data / Hadoop / IoT / M2M etc. fizzles out quickly as the stakeholders get frustrated about poor quality of data and resulting analytics.

In my last 3 implementations of streaming data, IoT on HBase platform, I observed a consistent pattern of data quality issues. We implemented a data quality measurement framework for streaming data which I will describe in this blog. This framework is built upon dimensions of data quality defined by DAMA (Data Management Association)

1. Timeliness – In case of streaming data, timeliness assumes highest significance. A particular sensor may stop sending data for few hours and then suddenly send backlog of data. This results in late arrival of data. Another case we observed is ‘out of sequence data’. In an IoT implementation sensors stopped transmitting data for few minutes and then when the issue is resolved, latest data is sent first and then the older data is sent.

                            Timeliness of High streaming data can be measured as follows:

                                 Number of records arriving within given time limit (e.g. 1 minute)


                                   Total number of records generated in a given timeframe 

Timeliness is most of the times a post facto measurement since the total number of records generated can be measured only after those records are actually loaded in database. We measure timeliness by periodically running a script that compares timestamp of record generation at sensor end and timestamp of getting loaded in the database.

2. Completeness – Data completeness can be defined as availability of mandatory elements of data. Some of the IoT systems transfer data in the form of text files that contain data accumulated at the sensor end over a short period (say 1 minute). To ensure completeness of data, a header and footer is added to the files. Incomplete transmittal of files results in files being transferred to server without a header or footer. This is an indication that data the file in question may be lacking completeness. We have measured completeness for streaming data as follows:

Number of files that are received in complete on the server


            Total number of files received on the server

3. Integrity – Data that is missing important linkages to other elements of data (mainly the master data) is said to have integrity issues. In multiple IoT implementations, we observed that the master data about the devices has to be first configured on the server and then the devices can start sending the data. In a telecom IoT project, we had a condition in which the handset had to be registered first and then it can send the streaming data to the server for tracking user experience and network performance. There were usual miss outs in configuring the device master data and the devises started sending the data even before configuration was done. This resulted in “Orphan sensor data” condition. We measured this Data Integrity as

                 (Number of orphan / un-configured devices)

             1 –        ——————————————————————————
                                      Total number of devices in the installation

4. Accuracy and Consistency – Traditional data quality literature has defined accuracy and consistency as two separate dimensions. However, in streaming / sensor / IoT data, these 2 are closely interrelated. A sensor that records temperature of a device as 70 degrees and in 1 second registers temperature as 125 degrees is mostly faulty. These kinds of data accuracy and consistency rules cannot be generalized, but need to be captured as business rules by the domain experts. Records need to be flagged as inaccurate based on the accuracy rules captured. Accuracy can then be defined as:

Number of records not flagged as inaccurate as per business rules


      Total number of records

We are piloting this data quality framework for streaming data at multiple implementations. Also we are developing a combined DQ-Index, which differs for each implementation depending on the importance of each dimension of data quality. Let us stay awake to the devils of bad data quality!




Kuldeep Deshpande