Hadoop Security: Prime Areas To Focus On – Part 2

Hadoop Security: Prime Areas To Focus On – Part 2

Hadoop Security: Prime Areas To Focus On – Part 2

This is a continuation to my previous article on ‘Hadoop Security: Prime Areas To Focus On’. If you have not read my previous blog, please go through it before you read this, to have a continuity.

Read it here: http://bit.ly/2uESmCM

In my previous article, I have shared our experiences implementing Hadoop security for areas like user authentication and data encryption and masking. In this article I would be covering interesting and useful information on other areas related to Hadoop security like user authorization and data governance, while implementing it for our customer.

Phase 3 of Hadoop Security Measure 

The borders been secure and authenticate to make sure the right people are able to enter the application. Data been secure and masked wherever required, to make sure nothing is out in the open. Now it’s time to manage the authorization level for various users to make sure the users can view and access whatever they are authorised to and nothing else.

User Authorization is concerned with who or what has access or control over a given resource or service. Since Hadoop merges together the capabilities of multiple varied, and previously separate IT systems/components as an enterprise data hub that stores and works on all data within an organization, it requires multiple authorization controls with varying granularities.

We performed the below steps for user Authorization –

  1. Tying all users to groups, which we had already created in the AD directories.
  2. Providing role-based access control for data access and ingestion, like batch and interactive SQL queries.


With respect to user Authorization, we performed the below activities:

  1. The existing files and directories were assign to the concern group as well as an owner was assign to them. Each assignment has a basic set of permissions available; file permissions are simply read, write, and execute, and directories have an additional permission to determine access to child directories.
  2. Extended Access Control Lists (ACLs) were also set for HDFS so as to provide fine-grained control of permissions for HDFS files so we could set different permissions for specific users or groups.
  3. We extensively use Apache HBase in our application and hence there was a high requirement to control who can query data using HBase. We set up various authorizations for various operations (READ, WRITE, CREATE, ADMIN) based upon column family. These authorizations set up at group level.


We decided to use Apache Sentry as a tool for configuring role-based access control so as to have a centralized system to manage the various roles and permissions.

Apache Sentry is a granular, role-based authorization module for Hadoop. It allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources. Sentry is highly modular and can support authorization for a wide variety of components in Hadoop. Sentry relies on underlying authentication systems, such as Kerberos or LDAP, to identify the user. It also uses the group mapping mechanism configured in Hadoop to ensure that Sentry sees the same group mapping as other components of the Hadoop ecosystem.

We created multiple groups in AD directory for e.g. Management, Technology, Batch and couple of admin groups. Then we created various roles in Sentry like ‘Auditor’, Read-Only’, ‘Cluster Administrator’ etc based on what kind of roles were required.

Based on these groups and roles, suitable role policies were assigned to these groups for e.g. Auditor role was assigned to management group which consisted of project managers and Architects, Read-Only role was assigned to Technology team and similar other admin roles were assigned to admin groups.

The biggest advantage of role based access controls (RBAC) is that it makes managing things like new users added, users deleted, rights revoked or granted etc pretty easy. Another thing that we experienced is that configuring RBAC through Sentry has a huge benefit, since you do not have to do this for various other Hadoop components and Sentry takes care of this by propagating it across various Hadoop components.


Phase-4  of Hadoop Security Measure

Suddenly one day the customer noticed some data discrepancies and started asking questions like:

What happened to my Customer details data, why is it showing some ambiguous data?

Which tables stores customer data?

What sources feed into the customer data?

Which users did access the files that feed customer data?

Which users queried/modified the tables containing customer data?

What did the user ‘cust_rep1’ do on Saturday the 29th of July?

What operations the user performed? and many other questions.


We started thinking why is the customer is asking us these questions, we are not employed in that organization, nor we are supposed to know the answer.

Well, it is correct that we are not the right persons to answer these questions and the customer should himself know all this data, but the most important thing to ponder upon is that the customer is using the application built by us for data storage, processing and access, have we empowered the application to answer all these questions to the customer?

The simple answer was ‘NO’, and we found a major area of data governance missing out in the application. But this just discover well in time, before the application was announce to be used by all the organization users.

So, we decided to use Cloudera Navigator and based on the customer requirements, we introduced the below data governance features in our application –

a) Data Auditing

Due to the sensitive data, customer wanted to know exactly who was accessing their data- Tom, John, Lucy?, what data they are accessing and how the users are using this data. The idea is to ensure that there are the correct governance measures in place to protect the sensitive data. Also any proactive/reactive measures based on data usage pattern and to track down any malicious data modification or access to the root.

We didn’t have to do much configuration as cloudera Navigator already captures all the data related to various Hadoop components like HDFS, Hive, Impala, HBase etc. but of course we had to create new roles specifically for data auditing purposes and these were provided to the users at the highest level.

b) Data Lineage-

Finding out data lineage would help serve the question ‘which sources feed the customers table?’. We decided to use cloudera navigator to enable data lineage which provide an inbuilt functionality for various Hadoop components like HDFS, Scoop, Spark, Impala etc. Moreover, both forward and backward data lineage is available and right upto the column level, which is amazing. We were able to view data lineages for various tables as well as queries and without any coding or configuration. This was a huge value add for the customer as there is no need of any technical expertise or knowledge.

  1. Searchable metadata-

To answer questions like ‘which tables store customer data?’ , either you need to open the design document or go through the individual tables in Hive to check. The simplest would be, if possible to ask Google about this.

You might be wondering where does Google come into picture here. So, not exactly google but navigator provides some search features based on tags. It help us to search the entities depending on the search keyword. We scheduled multiple meetings with the business users to understand the business language as well as some of the functionality. We took help from the technical teams who managed various source systems. Based upon all this information we created a simple tagging document. We used tagging document to tag various entities on the big data platform to meaningful business acronyms. This would enable a simple yet effective metadata search functionality in the application using navigator.

It was a wonderful experience. There were lots of learnings implementing this project and most importantly a happy and satisfied customer.

I hope that our Hadoop security and governance experience would be helpful to customers. Also individuals who are looking forward to implementing Hadoop security on big data platforms. In case of any questions/queries, please feel free to reach out to me.


References – https://www.cloudera.com/documentation/enterprise/latest/PDF/cloudera-security.pdf


I work with Ellicium Solutions pvt ltd as an AVP looking after projects in big data analytics area, helping clients to stay ahead in the competition and more importantly to serve their customers well.