The Big Data Success Story
Re-platforming to Hadoop helped a leading US University reduce data management costs by 50%
Big Data, Platform Reengineering
- Easier data processing due to simplified code base
- 50% cost saving in data management
- Robust and scalable data management platform
Our client is a leading private research university in the U.S. It has revolutionized higher education by integrating teaching and research. The university was looking for an alternate solution for better managing their humungous data. The solution had to be scalable and cost effective.
The Astrophysics department of the university was conducting research on large datasets of particle physics data. Data was held on SQL Server and Python programs and SQL queries were used to process the data. Data was also made available to other universities and government institutions for their analysis.
Ellicium facilitated adoption of Hadoop and Spark for data storage and processing, thereby reducing data management costs by over 50%.
Challenges with previous approach
- Managing several Terabytes of data on a single SQL Server database was becoming unmanageable
- Data had to be split into snapshots but that made the data processing logic complex
- Getting a single consolidated view of the data was becoming difficult
Ellicium Big Data architects chose Hadoop to be the new platform for the Particle Data Sets, due to the volume of the data and need for scalability. To achieve cost efficiency, Cloudera’s Distribution of Hadoop (CDH) – Express Version was finalised. Data had to be analysed using SQL queries, so Spark was chosen as the processing tool. Cluster sizing for the CDH cluster was done using Ellicum’s Cluster Sizing Tool. Ellicium provisioned an on-premise Hadoop cluster and loaded the particle datasets into the cluster.
Analysis of the data was done and the existing code base was also reverse engineered. Complex SQL server based scientific processing queries were migrated to Python and Spark on Hadoop. This included parsing binary data using PySpark and stored in a managed Hive table with more than billion rows.
- About 50% reduction in the data management costs
- Hadoop made the new platform robust and scalable
- Using Cloudera’s Express version made the solution cost effective
- Managing the processing engine became easy since logic related to snapshots was eliminated