Recently we worked on an interesting project. It was for our clients who is one of the leading insurance company dealing in specialty insurance. The client required curated information for better decision-making and analytics by going through large amounts of policy and claims data.
This business crucial data was stored across multiple source systems. Since the data was spread across multiple source systems it was complex and a time-consuming process to extract actionable business insights. In such cases, it becomes imperative for businesses to get a homogenous data system which becomes a single source of truth. We proposed to create a data lake that would act as a single source of truth.
Through Sqoop, we moved all the data from our client’s sources to Hadoop hive tables. After this, we validated the ingested data using some business and validation rules using Spark. Handling data on such a large scale we faced many challenges and problems, especially, while loading the data into the data lake.
- We had more than 200 hive tables. Each having 500+ columns.
- Multiple tables needed to read and load into a single table.
- The amount of data we were fetching was very voluminous, in terms of 10’s of TB’s.
- Processing and analyzing the data involved complex queries i.e. each query involved 15-20 table joins a couple of sub-queries and business logic.
Impact of these Challenges:
- We observed increased data loading time due to the handling of vast amounts of data generated by the client. Queries used to take 2-3 hours to execute.
- Increased processing time to load data in tables sometimes causes the query to fail, prompting the timeout error.
- Managing the data is of the utmost priority, that means checking and rechecking the joins made. An error with a single join could lead to the query getting stuck and extremely difficult to debug and check.
Data processing was taking an excessive amount of time and due to stringent timelines, we could not afford to spend the increased amount of time on it. We were desperate to simplify things and at the same time to speed the data processing tasks. We had multiple technical discussions and we tried a variety of approaches, some worked, some did not.
I would like to walk you through this interesting information on a couple of things that helped us achieve our goals.
Things that worked for us were –
1) Using Spark Engine –
Processing and loading the large quantity of data took a long time. Loading the data in Hive using MapReduce engine proved very slow and tedious. Hence, we decided to use the spark engine. Using the spark engine decreased our loading time by from few hours to 20-40 minutes.
2) Using Intermediate stages –
The data being loaded to Hive tables was populated from multiple tables. This required an immense amount of time to load and so the intermediate table system was used. This method simplifies the data loading in the target tables.
3) Selecting only required columns –
We had source tables having numerous columns, many exceeding 500+ columns. This necessitated us to load data from source to target hive tables in a more selective and concise. We used sub-queries to select only required and essential columns from the source table to load into the Hive table.
4) Using an optimized storage format –
We used the Parquet file format to store data in tables. Parquet stores nested data structures in a flat columnar format making it 3x more efficient in terms of storage and query performance.
This was an interesting use case which we tackled. It amplifies the importance of creating a robust data lake to get actionable insights on the go. There will always be roadblocks and obstacles during the implementation of any kind of solution, but the key lies in:
- Keeping your priorities clear on what to achieve
- Team Discussions
- Practically trying out various approaches rather than relying on pure theory
- Some things from the old school of thought always help