Hadoop Security: Prime Areas To Focus On – Part 2

Hadoop Security: Prime Areas To Focus On – Part 2

Hadoop Security: Prime Areas To Focus On – Part 2

This is a continuation to my previous article on ‘Hadoop Security: Prime Areas To Focus On’. If you have not read my previous blog, please go through it before you read this, to have a continuity.

Read it here: http://bit.ly/2uESmCM

In my previous article, I have shared our experiences implementing Hadoop security for areas like user authentication and data encryption and masking. In this article I would be covering interesting and useful information on other areas related to Hadoop security like user authorization and data governance, while implementing it for our customer.

Phase 3 of Hadoop Security Measure 

The borders have been secured and authenticated to make sure the right people are able to enter the application, data has been secured and masked wherever required, to make sure nothing is out in the open. Now it’s time to manage the authorization level for various users to make sure the users can view and access whatever they are authorised to and nothing else.

User Authorization is concerned with who or what has access or control over a given resource or service. Since Hadoop merges together the capabilities of multiple varied, and previously separate IT systems/components as an enterprise data hub that stores and works on all data within an organization, it requires multiple authorization controls with varying granularities.

We performed the below steps for user Authorization –

  1. Tying all users to groups, which we had already created in the AD directories.
  2. Providing role-based access control for data access and ingestion, like batch and interactive SQL queries.


With respect to user Authorization, we performed the below activities:

  1. The existing files and directories were assigned to the concerned group as well as an owner was assigned to them. Each assignment has a basic set of permissions available; file permissions are simply read, write, and execute, and directories have an additional permission to determine access to child directories.
  2. Extended Access Control Lists (ACLs) were also set for HDFS so as to provide fine-grained control of permissions for HDFS files so we could set different permissions for specific users or groups.
  3. We extensively use Apache HBase in our application and hence there was a high requirement to control who can query data using HBase. We set up various authorizations for various operations (READ, WRITE, CREATE, ADMIN) based upon column family. These authorizations were set up at group level.


We decided to use Apache Sentry as a tool for configuring role-based access control so as to have a centralized system to manage the various roles and permissions.

Apache Sentry is a granular, role-based authorization module for Hadoop. It allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources. Sentry is highly modular and can support authorization for a wide variety of components in Hadoop. Sentry relies on underlying authentication systems, such as Kerberos or LDAP, to identify the user. It also uses the group mapping mechanism configured in Hadoop to ensure that Sentry sees the same group mapping as other components of the Hadoop ecosystem.

We created multiple groups in AD directory for e.g. Management, Technology, Batch and couple of admin groups. Then we created various roles in Sentry like ‘Auditor’, Read-Only’, ‘Cluster Administrator’ etc based on what kind of roles were required.

Based on these groups and roles, suitable role policies were assigned to these groups for e.g. Auditor role was assigned to management group which consisted of project managers and Architects, Read-Only role was assigned to Technology team and similar other admin roles were assigned to admin groups.

The biggest advantage of role based access controls (RBAC) is that it makes managing things like new users added, users deleted, rights revoked or granted etc pretty easy. Another thing that we experienced is that configuring RBAC through Sentry has a huge benefit, since you do not have to do this for various other Hadoop components and Sentry takes care of this by propagating it across various Hadoop components.


Phase-4  of Hadoop Security Measure

Suddenly one day the customer noticed some data discrepancies and started asking questions like:

What happened to my Customer details data, why is it showing some ambiguous data?

Which tables stores customer data?

Which sources feed into the customer data?

Which users did access the files that feed customer data?

Which users queried/modified the tables containing customer data?

What did the user ‘cust_rep1’ do on Saturday the 29th of July?

What operations the user performed? and many other questions.


We started thinking why is the customer asking us these questions, we are not employed in that organization, nor we are supposed to know the answer.

Well, it is correct that we are not the right persons to answer these questions and the customer should himself know all this data, but the most important thing to ponder upon is that the customer is using the application built by us for data storage, processing and access, have we empowered the application to answer all these questions to the customer?

The simple answer was ‘NO’, and we found a major area of data governance missing out in the application. But this was just discovered well in time, before the application was announced to be used by all the organization users.

So, we decided to use Cloudera Navigator and based on the customer requirements, we introduced the below data governance features in our application –

a) Data Auditing

Since it was very sensitive data, customer wanted to exactly know who was accessing their data- Tom, John, Lucy?, what data they are accessing and how the users are using this data. The idea is to ensure that there are the correct governance measures in place to protect the sensitive data as well as take any proactive/reactive measures based on data usage pattern and to track down any malicious data modification or access to the root.  We didn’t have to do much configuration as cloudera Navigator already captures all the data related to various Hadoop components like HDFS, Hive, Impala, HBase etc. but of course we had to create new roles specifically for data auditing purposes and these were provided to the users at the highest level. Viewing details about an entity using navigator are as simple as searching that entity and a click. Also using navigator, we were generating a couple of scheduled reports for the customer

b) Data Lineage-

Finding out data lineage would help serve the question ‘which sources feed the customers table?’. We decided to use cloudera navigator to enable data lineage which is provided as an inbuilt functionality for various Hadoop components like HDFS, Scoop, Spark, Impala etc. Moreover, both forward and backward data lineage is available and right upto the column level, which is amazing. We were able to view data lineages for various tables as well as queries and without any coding or configuration, which was a huge value add for the customer as there is no need of any technical expertise or knowledge.

  1. Searchable metadata-

To answer questions like ‘which tables store customer data’ , either you need to open the design document or go through the individual tables in Hive to check. The simplest would be, if possible to ask Google about this.

You might be wondering where does Google come into picture here. So, not exactly google but navigator provides some search features based on tags that help us to search the entities depending on the search keyword. We scheduled multiple meetings with the business users to understand the business language as well as some of the functionality. We took help from the technical teams who managed various source systems. Based upon all this information we created a simple tagging document that we used to tag various entities on the big data platform to meaningful business acronyms. This would enable a simple yet effective metadata search functionality in the application using navigator.


It was a wonderful experience and there were lots of learnings implementing this project and most importantly a happy and satisfied customer and more than that a satisfaction that we are able to give what the customer exactly wanted and that the application is heavily utilised and providing a value add to its users.

I hope that our Hadoop security and governance experience would be helpful to customers as well as individuals who are looking forward to implementing Hadoop security on big data platforms. In case of any questions/queries, please feel free to reach out to me.


References – https://www.cloudera.com/documentation/enterprise/latest/PDF/cloudera-security.pdf


Rohan Karanjawala
I work with Ellicium Solutions pvt ltd as an AVP looking after projects in big data analytics area, helping clients to stay ahead in the competition and more importantly to serve their customers well.