How To Ensure A Successful Big Data Proof of Concept?
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” This is how Abraham Lincoln described his habit of meticulous planning. But at Ellicium we have taken this quote seriously when it comes to planning a Big Data implementation.
We have helped number of clients from conceptualization to Big Data Proof of Concept (POC) to production implementation of various Big Data use cases mainly for streaming and Internet Of Things (IOT) data using our Ellicium’s Gazelle. A common factor in all successful Big Data Proof of concept to Production journeys has been ‘planning well even before the POC’.
This is a routine journey for us now given that the solution has matured with each implementation. Learnings from each Big Data Proof of Concept have helped us make each Big Data POC more effective in terms of value delivered and readiness for production implementation. Some key lessons that we learnt are:
- Define governance – Big Data projects can turn out to be one of the most multi-dimensional endeavours in the organization. In one of our implementations, client’s legal team was required to comment privacy issues involved when extracting customer data from the web. There will be a constant need for business users to comment on whether additional investment in Big Data is resulting in multi fold increase in decision making capability. The list can go on and on. All these decisions need to be orchestrated from an organizational perspective by a committee of senior executives. Having a ‘Big Data governance council’ is a must. Based on our experience and learning, we recommend to establish a Big Data Governance Council. We recommend such a council and even clients setup the council so that the Big Data Proof of Concept adds value to balance sheet of the organization and does not remain one odd technology initiative.
- Do you really have a need for a Big Data platform – Many a times we have come across clients who do not actually have a Big Data scenario, but are ‘lured’ by utopian promises of Big Data. We had a customer who was managing a greatly complex business with excel spreadmarts. Surely they needed a datawarehouse but not a Hadoop based Big Data system. Evaluate your data volumes and potential growth and ability of existing technology stack to meet the demand. Developing and managing a Big Data system is a big ask that may consume a lot of your IT bandwidth. This is a common knowledge but we have come across client situations where this basic fact needs to be reiterated.
- Involve Big Data distribution vendor at the POC conceptualization stage – Cloudera, DataStax, Hortonworks, IBM, MapR, Google – whatever distribution of Hadoop / Big Data you are choosing for the POC, it is a must to involve these companies from conceptualization stage. They are a great help in reviewing architecture, hardware sizing, sharing experience from past implementations. Although our velocity data solution has a recommended stack of technologies to be used, we always make sure we consult our Big Data distribution partner for critical decisions. This has averted situations like underfed hardware, wrong choice of visualization tool (due to client insistence) etc.
- Consider available skills in your organization – In most of the cases maintaining a Big Data application needs strong Linux skills (HDP from Hortonworks is an exception). Does your organization have skills to maintain an environment of Linux machines? Do you have system admin and programmers? One of our clients was a Microsoft shop and we were implementing a midsized Hadoop cluster for their IOT data. We requested the IT managers to start hiring Linux system admin and programmers so that by the time the POC was completed, our client had a team to own production implementation.
- Agile, agile and agile – Big Data projects are risky and uncertain given that the technologies involved are constantly evolving, user data needs may not be crystal clear and new findings may change the direction. Going by a ‘waterfall style’ project plan is the biggest mistake a PM for Big Data Proof of Concept can commit. Agile helps to change the focus as per need of the hour. In one of the implementations after first 2 weeks of project, we knew performance is going to be a bottleneck. Hence we stopped all further work on data visualization and focused all our energies on tuning hardware and software for higher performance. This averted a ‘functionally correct but nonperforming’ product. There will be times when data that is being used will be of bad quality and alternate sources will have to be identified. All this requires an agile methodology to survive and succeed.
- Commodity hardware is not so cheap! – So what everybody has heard about Big Data is ‘you throw in some commodity hardware and reap the benefits of distributed processing’. At least that’s what the perception is for some of the non IT business executives. But remember commodity hardware does not mean cheap / low capacity hardware. In fact, for cases like processing of streaming data (which is what we do for living); having a strong cluster of nodes consisting of at least 16-32 GB RAM and 1 GB connection is recommended. Be aware of these requirements before you commit to a Big Data Proof of Concept. One of our clients almost killed the Big Data project with scrawny 4 GB RAM machine cluster and expecting to process billion records per hour.
Having proven value of Big Data during the Proof of Concept, your business users will always want a fast transition to production implementation. Points discussed above ensure that bridge is crossed with minimal hassles.