The Data Lake And Cloud Success Story
Data Lake enables seamless analytics of legacy data for Healthcare organization
Data Lake and Cloud
- The enterprise data lake enabled downstream users to capitalize on the value in data, by bringing together internal and external datasets in a single place
- Easy and fast access to variety of data
- Capability to ingest and process any type of data i.e. structured, semi-structured and unstructured
Our client is one of the largest health care organization in the United States. It has few nationally recognized academic hospitals. The organization is on an endeavor to modernize its IT system to accommodate their growing needs and save costs.
They had crucial and confidential data stored in silos across 55 data sources like IBM DashDB, Oracle DB, SQL server etc. in structured, semi structured and unstructured format. It was critical to get all this data at one place to empower various data consumers and stakeholders across the organization for better data insights, decision making and analytics.
Ellicium implemented the Data Lake on Azure Cloud after analyzing business requirements and pain points of the customer.
Current systems posed following challenges
- Complex data structure in legacy source systems made self service reporting impossible
- Variety of data sources
- High data volume and complex business logic
The solution required us to create a data ecosystem (data lake and data pipeline) that would be a central location for providing business users and service groups with access to core data domains and critical data regarding patients and clinical trial results. Based upon the business and technical requirements to enable the data lake, we envisioned a cloud-based platform that would accommodate different types of data and compute needs that were most relevant to the business – a data pipeline that ingests and transforms data to the data lake which acts as a central repository, discovery through a data catalog, data access methods to support disparate needs integrated with a cloud computing environment for applications and analytics.
Microsoft Azure was chosen as the cloud provider and the solution was executed using the agile methodology.
At a high level, the main tasks consisted of
- a) Building and supporting the Azure data lake platform
b) Master and Meta data management and Data Quality
c) Data modelling
d) ETL ingestion pipeline using Talend
e) Ingesting the cleansed, validated, transformed and normalised data in the data lake
f) Data warehouse creation on Snowflake
g) Security implementation
h) Continuous integration/continuous delivery (CI/CD) pipeline
i) Azure Cost Optimization
- An enterprise data lake was created which was available 24/7 for the downstream users.
- This helped them to easily access enterprise wide data at a single place rather than being dependent on various internal and external systems to provide them data.
- Downstream users can perform analysis at 2 levels i.e. via data cataloging services on Azure as well as using the datawarehouse on Snowflake for faster analytics.