Enemy #1 for Hadoop adoption – Bad Data Quality
“Fast is fine but accuracy is everything” – Wyatt Earp
What Wyatt Earp said about gunfights in the Wild West strangely applies to Big Data / Hadoop / IoT movement that is equally wild! In their 2015 Big Data market update, Forbes confirmed that quality of data is the biggest problem in developing big data applications.
In fact, Poor Data Quality is the biggest
challenge to Hadoop adoption at Production level.
Initial enthusiasm in enterprises about Big Data / Hadoop / IoT / M2M etc. fizzles out quickly as the stakeholders get frustrated about poor quality of data and resulting analytics.
In my last 3 implementations of streaming data, IoT on HBase platform, I observed a consistent pattern of data quality issues. We implemented a data quality measurement framework for streaming data which I will describe in this blog. This framework is built on dimensions of data quality which is define by DAMA (Data Management Association)
1. Timeliness –
In case of streaming data, timeliness assumes highest significance. A particular sensor may stop sending data for few hours and then suddenly send backlog of data. This results in late arrival of data. Another case we observed is ‘out of sequence data’. In an IoT implementation sensors stopped transmitting data for few minutes and then when the issue is resolved, latest data is sent first and then the older data is sent.
Timeliness of High streaming data can be measure as follows:
Number of records arriving within given time limit (e.g. 1 minute)
Total number of records generated in a given timeframe
Timeliness is most of the times a post facto measurement since the total number of records generated can be measured only after those records are actually loaded in database. We measure timeliness by periodically running a script that compares timestamp of record generation at sensor end and timestamp to get load in the database.
2. Completeness –
Data completeness can be define as availability of mandatory elements of data. Some of the IoT systems transfer data in the form of text files that contain data accumulated at the sensor end over a short period (say 1 minute). To ensure completeness of data, a header and footer is in addition to the files. Incomplete transmittal of files results in files being transfer to server without a header or footer. This is an indication that data the file in question may be lacking completeness. We have measured completeness for streaming data as follows:
Number of files receive in complete on the server
Total number of files received on the server
3. Integrity –
Data missing important linkages to other elements of data (mainly the master data) have integrity issues. In multiple IoT implementations, we observed that the master data about the devices has to be first configured on the server and then the devices can start sending the data. In a telecom IoT project, we had a condition in which the handset had to be registered first and then it can send the streaming data to the server for tracking user experience and network performance. There were usual miss outs in configuring the device master data and the devises started sending the data even before configuration was done. This resulted in “Orphan sensor data” condition. We measured this Data Integrity as
(Number of orphan / un-configured devices)
1 – ——————————————————————————
Total number of devices in the installation
4. Accuracy and Consistency –
Traditional data quality literature has defined accuracy and consistency as two separate dimensions. However, in streaming / sensor / IoT data, these 2 closely interrelate. A sensor that records temperature of a device as 70 degrees and in 1 second registers temperature as 125 degrees is mostly faulty. These kinds of data accuracy and consistency rules is unable to generalize, but need to capture as business rules by the domain experts. Records need to flag as inaccurate based on the accuracy rules. Accuracy can defined as:
Number of records not flagged as inaccurate as per business rules
Total number of records
We are piloting this data quality framework for streaming data at multiple implementations. Also we are developing a combined DQ-Index, which differs for each implementation depending on the importance of each dimension of data quality. Let us stay awake to the devils of bad data quality!