How I saved 200 hours of Ellicium’s Recruitment Team Using Hadoop!

How I saved 200 hours of Ellicium’s Recruitment Team Using Hadoop!
Resume Automation Using Hadoop
Chaitali Sonparote
Posted by on June 5, 2018 in Blog

How I saved 200 hours of Ellicium’s Recruitment Team Using Hadoop!

Ellicium is growing at fast pace. And as for every growing organization, talent hunt becomes the task with the highest priority.

When I joined Ellicium, it was still in its initial phase, where we were finding our space in Big Data market. I have witnessed and been part of Ellicium team with the zeal and passion to make a dent in the universe of Big Data. After executing and implementing some unique projects, Ellicium’s team size has increased 300% along the way. One of the biggest challenges in this growth story has been to find a right candidate for the specific position. And many a time, I have seen recruitment team’s struggle to shortlist candidates from the sea of applications.


Recently we had several urgent open positions for which we wanted to hire candidates through a recruitment drive.  To our surprise, we received few hundred resumes. Taking the typical manual resume shortlisting approach would have cost us more than 200 human hours. We wanted to get right candidates fast. As human resource team was mulling over the approach to pull this gigantic task, I stepped in!

For one of the Big Data implementation, there was similar client requirement, where they were struggling to search business-critical data from a huge amount of unstructured data stored in the form of files in quick time. For that implementation, I used Apache Tika, Apache Solr, Cloudera Search and HDFS from Hadoop framework.

Below is the architecture diagram of our approach:

Resume Sorting Automation

Details of Resume Shortlisting Using Big Data Technologies:

Our recruitment team had stored received resumes in local machines. First and foremost, I migrated all these files from local machines to HDFS. Then I used Apache Tika to convert resumes into text files and to extract metadata information from those files. This is a vital preparatory stage, as information gathered here becomes an input for further process.

Once I had metadata information for all files, as shown in the diagram, Apache Solr was used for indexing the files. Apache Solr offers neat and flexible features for search. For indexing, required parameters from the extracted metadata need to be given to Apache Solr. These parameters include file name, file id, size of the file, author, date when it was created and last modified.

All this speeds up the file search on a specific query. To make search more user-friendly for a non-technical user, we used Cloudera search. The Cloudera search comes with more presentable GUI, where you can easily put your query and get the results. In our case, it helped us to shortlist resumes with required skill sets.   For example, let’s say I want to find resumes with Java skills. I will just put these two keywords in the Cloudera search text box and as a result, I will get all the relevant resumes for direct download.

Resume Sorting Using Big DataPicture 1: All Resumes


Sorted Resume Using Big Data

Picture 2: Resume Shortlisted For Java

Going one step further, using HUE GUI we created dashboards to perform analytics on received resumes. For example, assume that I want to know out of all the received resumes how many of them have mentioned Python as a skill. I will use pie chart utility in the HUE to visualize it.

Against the manual method, resume shortlisting became one-click task using above approach. We saved almost 200 human working hours. Above all, it made our recruitment process fast! New Ellicians will join us soon.

Given the exponential growth of unstructured data, there can be many applications of this approach. Be it search on inventory information documents in retail industry or finding the medical history of a specific patient from a huge number of reports, this approach is a life saver!

This Resume Shortlisting implementation will soon be a part of Ellicium’s Gadfly platform. ‘Gadfly’ is analytics platform for unstructured data.