Hadoop Security: Prime Areas To Focus On
“The precondition to freedom is security”
As rightly mentioned by Rand Beers, the security advisor to president Barack Obama.
Similarly, the key to using big data applications effectively and to the fullest, is how secure these applications actually are.
After successfully implementing our data ingestion and analysis platform Gazelle ™ for one of our clients in IOT (IIOT) area in Jan 2017, our next project was to implement security measures in this application for the customer so to make maximum use of this application without any external threat as well as always in a comfortable position to comply by any regulations.
Based on the customer’s preferences, we have been using Cloudera distribution of Hadoop for the implementations of big data projects for this client.
Conceptually there are 4 prime areas where we focussed on beefing up the Hadoop security for this application viz –
- Entry level authentication
- Data access controls
- Data Security
- Data Governance
While planning for this project, we decided to go in phases rather than eat the entire pie in one go. I have always advocated for a phased manner approach in dealing with big and complex projects, because this not only helps to understand the risks, dangers and complexities involved in the project, but also helps in validating our approach, whether it is the right one and would help us in achieving the required target in a smooth way, or else make changes to come up with a more suitable approach.
We planned the below phases for implementing the Hadoop security measures –
- Phase 1: In this phase we start with the basics. first i.e. set up authentication checks to prove that users/services accessing the cluster are who they claim to be. This involves setting up users/groups in AD as well as configuring the access controls.
- Phase 2: We have taken care of configuring the authentication details for the users/groups, in this phase we take care of the data in rest as well as data in motion by taking some measures to introduce data encryption. As well as we also need to take care that any sensitive data should not be accessible to end users, hence we need resolve this using data masking.
- Phase 3: In this phase, we plan to introduce secure measures to control who views what based on the authorization level setup for the users/groups. This needs to be done at individual Hadoop component level.
- Phase 4: For more robust security there needs to be data governance aspects to be taken into account like auditing, data lineage etc. Data governance is an important aspect of security. Governance includes auditing accesses to data residing in metastores, reviewing and updating metadata, and discovering the lineage of data objects.
Let’s start with phase-1 wherein we work on the basic part of securing the gates for our castle.
Phase -1 of Hadoop Security Measure – User Authentication
The purpose of authentication is to make sure that the person/system that is accessing the application are the right individuals who are supposed to use the application. Typically, authentication in enterprises is managed through a single distributed system, such as LDAP directory which consists of username/password mechanisms.
We decided to use Kerberos authentication which is a common and secure enterprise-grade authentication system. Kerberos provides strong security benefits including capabilities that render intercepted authentication packets unusable by an attacker. It virtually eliminates the threat of impersonation by never sending a user’s credentials in cleartext over the network.
The user credentials were stored in AD which used Kerberos authentication for security.
We used the Cloudera Manager’s Kerberos wizard to automate Kerberos configuration on the cluster. Cloudera manager was configured to use its internal database as well as external AD to authenticate the users. There were couple of groups created on AD already and we provided login access based on the groups who were supposed to access this application. There was an IT group which was provided full administrator access to Cloudera manager.
Phase – 2 of Hadoop Security Measure – Data Encryption
In this phase, all data on the cluster, at-rest and in-motion, must be encrypted as well as sensitive data needs to be masked. A completely secure enterprise data hub is one that can stand up to
the audits required for compliance with PCI, HIPAA and other common industry standards.
Our strategy with respect to data encryption was to encrypt the data on HDFS and use the Cloudera Navigator Key Trustee Server for storing the keys. The reason for going with HDFS encryption is that unlike OS and network-level encryption, HDFS transparent encryption is end-to-end. That is, it protects data at rest and in motion, which makes it more efficient than implementing a combination of OS-level and network-level encryptions.
HDFS Encryption implements transparent, end-to-end encryption of data read from and written to HDFS, without requiring changes to application code. Because the encryption is end-to-end, data can be encrypted and decrypted only by the authorised user. HDFS does not store or have access to unencrypted data or encryption keys. This supports both at-rest encryption (data on persistent media, such as a disk) and in-motion encryption (data traveling over a network).
Again, enabling HDFS encryption was easily achieved using Cloudera manager wizard. Here we also configured Cloudera Navigator Key Trustee Server for storing the keys. You should do this in case of production systems. Enabling HDFS encryption involves many steps, but it is pretty well documented from Cloudera and easy to follow. We also performed couple of steps for securing data transport on HDFS as well as HBase, wherein data in case of HDFS is transported between datanodes and clients as well as among data nodes and in case of HBase it is transported between Hbase masters and region servers.
Hadoop Security Measure Data Masking –
After we secured our data at rest and in motion, it was a very pleasant feeling as this was a very important task for us. But the tasks with data were not over yet. Data was encrypted, but still the user who has administrator rights can easily decrypt the data using the keys and view the data. So, the next step was to mask the sensitive data that was stored on Hadoop. In case of our customer there was some sensitive data related to some domain knowledge like formulae, expressions for various processes which was very critical for the client and at any cost they could not afford to those being leaked out of the organization. They were supposed to be the main intellectual property for the client driving their business and ultimately their revenue. Apart from this there was critical end customer information also which was sensitive and should be masked. This is called data redaction and we can enable or disable redaction for the whole cluster with a simple HDFS service-wide configuration change.
Using Cloudera manager, we setup quite a few redaction rules as the data that had to be masked was not standard like any credit card or SSN information, but was some formulae and expressions as well a customer credentials. While creating redaction rules, the components that we used were –
a) Search –
We had to search for formulae/expressions as well as customer emails and phone numbers to name a few, for masking. A regular expression was built for the same. If the regular expression matches any part of the data, the match is replaced by the contents of the replace string.
b) Replace –
The string used to replace the sensitive data.
c) Trigger –
This component proved to be very important for us. This specifies a simple string to be searched for in the data. The redactor searches for matches to the search regular expression only if this string is found. By the way this component is optional, so if no value for trigger component is specified, redaction occurs when the Search regular expression is matched. Also from performance point view Trigger field improves performance since simple string matching is faster than regular expression matching. The regular expressions for data masking were proving to be complex and having this trigger component helped us a lot.
Once the rules were identified, it was time to do a simple configuration in Cloudera manager to enable the log and query redaction and then we added the rules that we had identified earlier. I really appreciate the documentation provided by Cloudera which is crystal clear as well as the ease with which various configuration tasks can be performed.
It was a very interesting exercise architecting and working on implementing Hadoop security for our customer. If you also think so, it’s not over yet. In my next article, I’ll cover the remaining 2 phases including data governance as well as some other aspects of Hadoop Security that we took care of in this program.