Testing Methodology For Big Data Project
Exponential growth in ‘data’ generation will push organizations to integrate Big Data in their strategy as de facto technology to derive value out of data. For the same reason 2018 will be the year of rapid adoption of Big Data. Everyone wants to catch that bus! Thing is the process to test Big Data applications before deployment is still not fully developed. This very fact can make it a bumpy ride!
To overcome the issue of lack of documentation and processes for testing Big data applications, I tried to define a testing process based on my experience. I have had the good fortune to work on some of the unique Big data projects at Ellicium. For the sake of easy understanding, I am taking an example of a testing process which I created while working on an IOT based project. It was for one of the biggest telecom company in India.
Our client was struggling to find a tool which can do a detailed analysis of the huge volume of streaming data. Most of this data gets generated through the usage of a mobile phone by customers in the network. Our client’s aim was to use this analysis to provide intelligence to the marketing department to up the marketing game and to improve customer’s experience of using the network.
Following is the high-level architecture of that application:
As you can see in the architecture, there were a variety of independent processes. Hence, it was important to have a separate methodology for testing each process. So, let’s see how I did it!
First and foremost, I categorized processes based on their behavior. Check the image below to know the clusters which we created.
Based on this categorization and sequence, I created testing flow. I have discussed exact steps which we followed.
Step 1: Data Staging Validation
When: Before data lands into Hadoop
Where: Local File System
- Compare the number of files – We had used shell scripts to compare the count of files in source and destination folders using timestamp. The timestamp can be extracted from filenames as files were named after creation time.
- Validating the number of records – We had used shell scripts to validate number of records before and after processing.
- Identify outdated data – We had used shell scripts to check if there is any out-dated data coming in from sources. This can happen when the user has switched off his internet connection for some amount of time and once he switches on the internet it will start sending previously generated data to the data server. In such scenarios, we identify such files and move them out of the current process so that they don’t unnecessarily consume processing time and space.
Step 2: Data Processing Validation
When: After processing or aggregating data stored in Hadoop
- I had fetched few samples from aggregated data stored in an aggregated table using a hive query and compared with raw data stored in the raw data table.
- I had written a shell script which counts a number of records in source files for certain period with data stored in HBase using hive query written in same shell script.
- There was a step where we chopped CSV files for location network vertically based on row key as flume can take only 30 columns at a time whereas we had more than 55. This makes data testing crucial as it is not only about horizontal data testing but about vertical too. I wrote a hive script which periodically checks before every aggregation that there are no NULL columns in HBase.
Step 3: Reports Validation
- Data Validation: I used to test data in the PHP based report for different combinations of time. The crux of the data testing here is that if I don’t find data on report for a specific period, I would have to reverse engineer whole process until I find at which stage the data was missed. The data can be missed when impala fetches data for a report or during hive aggregation or while flume loads data or when data pre-processing is done using scripting. This is strictly a manual effort and requires access to whole system and patience.
This was the whole process we followed for testing. After this, we have used above framework for other Big Data projects.
Though this solved the problem for us, there are some challenges while doing testing for Big Data based applications.
Big Data Testing Challenges
- Expensive to create a test environment
It is bit expensive to create a test environment for Big Data projects. With the due risk of data handling, it is not advised to do testing in the development environment. Hence, it becomes inevitable to have separate machines for testing. Based on the volume of data, the cost may vary.
- Difficult to obtain right/meaningful testing data
This one can be a real show stopper. At the same time, it is vital to get it right. With data variety and volume, it becomes a bit difficult to get meaningful testing data. To solve this, make sure to find a sample of data which will cover all negative test cases. It can be a bit time-consuming task. But needs to be done!
I hope this article helps those who are looking for a testing methodology for Big Data applications.
Have you followed some other process? I would love to hear from you. Do share your story with me here: firstname.lastname@example.org