Re-platforming to Hadoop helped a leading US University reduce data management costs by 50%

Industry

Education

Our Expertise

Big Data, Platform Reengineering

Key Benefits

Easier data processing due to simplified code base
50% cost saving in data management
Robust and scalable data management platform

Technologies Used

The Solution

Ellicium Big Data architects chose Hadoop to be the new platform for the Particle Data Sets, due to the volume of the data and need for scalability. To achieve cost efficiency, Cloudera’s Distribution of Hadoop (CDH) – Express Version was finalised. Data had to be analysed using SQL queries, so Spark was chosen as the processing tool. Cluster sizing for the CDH cluster was done using Ellicum’s Cluster Sizing Tool. Ellicium provisioned an on-premise Hadoop cluster and loaded the particle datasets into the cluster.

Analysis of the data was done and the existing code base was also reverse engineered. Complex SQL server based scientific processing queries were migrated to Python and Spark on Hadoop. This included parsing binary data using PySpark and stored in a managed Hive table with more than billion rows.

Summary

Our client is a leading private research university in the U.S. It has revolutionized higher education by integrating teaching and research. The university was looking for an alternate solution for better managing their humungous data. The solution had to be scalable and cost effective.

The Astrophysics department of the university was conducting research on large datasets of particle physics data. Data was held on SQL Server and Python programs and SQL queries were used to process the data. Data was also made available to other universities and government institutions for their analysis.

Ellicium facilitated adoption of Hadoop and Spark for data storage and processing, thereby reducing data management costs by over 50%.

Challenges

Challenges with previous approach

Managing several Terabytes of data on a single SQL Server database was becoming unmanageable
Data had to be split into snapshots but that made the data processing logic complex
Getting a single consolidated view of the data was becoming difficult

The Results

About 50% reduction in the data management costs
Hadoop made the new platform robust and scalable
Using Cloudera’s Express version made the solution cost effective
Managing the processing engine became easy since logic related to snapshots was eliminated