How we solved the challenge of dynamic Hadoop configuration in Talend?
We recently helped a leading telecom company to build Data Lake in Hadoop to implement analytics. The use case involved the migration of terabytes of data from the Oracle database to Hive using Talend. It involved complicate business rules as data sourced from varied sources like Oracle, ERP systems and flat files. The objective was to do migration using a robust and configurable tool.
Our Talend job design
Our Talend jobs designed in the following way –
A joblet used to pick up dynamic parameters from configuration tables. These included all the custom values that would have to be changed between environments.
A parent job (Talend Standard Job) used to pick up data from the required sources and write it to an intermediate file on HDFS.
A child job (Spark Batch Job) then triggered to copy the data from that intermediate file to a Hive managed table in parquet format.
The challenge of Hadoop Cluster Configuration In Talend
In this entire migration process, the challenge was to customize the cluster configuration. In the absence of such custom configuration, Talend uses its own default Hadoop configuration. This causes problems when used with Hadoop clusters that have been set up differently from what Talend expects.
Typically Hadoop parameters are different for different environments like development, pre-production, and production. Examples are Keytab, Kerberos, resource manager, high availability properties, and failover nodes, etc. Hence, when code move across environments these parameters have to be manually changed.
However, we were looking for a solution that would allow the code to be migrated across environments without changing the parameters manually.
The way in which we temporarily achieved that was by passing a .jar file to the job, which contained xml files from the cluster, containing the required configuration. However, this jar file would have to be changed and the job would have to be rebuilt every time there are configuration changes or the job had to be migrated to a different environment.
This was not suitable for our use case, so we considered different approaches to overcome the problem.
Selecting the best approach to get over the roadblock:
After considering many scenarios, we zeroed down on the top three approaches. As explained below, we tried all these three approaches.
Approach 1: Changing the Custom Configuration
We considered an approach where different.jar files were used, based on the environments. We created 2 context groups and could pass –context=Prod or –context=Dev depending on the .jar file that we wanted to load. This approach partially worked, but the child job was unable to receive that context group’s value dynamically. Dynamic context can be pass to standard jobs, not to big data jobs.
Furthermore, it would involve a rebuild every time we had to change the configuration.
Approach 2: Manually Adding Custom Properties
There were a total of 226 properties for the Talend job in the cluster configuration that needed customization for successful data migration. In this approach, we manually added these properties to the Hadoop components in the Talend job. We were able to change properties related to High Availability (HA) and Key Management Servers (KMS), which involved 17 properties in total.
This fixed the parent job, and we were able to write the file to HDFS. However, the child job failed even after adding all 226 properties! We even discussed this with Talend support, and they were unable to provide a reason or a solution. Hence, this approach was also discard.
Approach 3: Removing child job and creating an external table
After considering the above two approaches, we went back to the drawing board and put together an innovative approach. It took a lot of interesting discussions with our Big Data experts to find a way which was fitting the bill.
In this approach, the location where the intermediate file was present on HDFS was used as the directory for external Hive tables. The process of moving data from the HDFS file to the managed Hive table was no longer required, so we discarded the Big Data child job. This worked! With this approach, we were able to reduce the build size by 90%, since additional Hive and Spark libraries were not required anymore. This also improved performance, since the second step was eliminated. Even though we didn’t achieve making the job completely dynamic, we were able to complete our functional objective.
Below is the diagram explaining the approach:
Though dynamic configuration is available in Talend, it does not work in an expected way. Talend needs to work on making the Hadoop components more configurable for Big Data jobs. At the same time, we would like to mention that Talend is moving in the right direction when it comes to the Big Data landscape.