5 Data Warehouse implementation mistakes to avoid in Big Data Projects
Data warehouse implementations are tricky. An enterprise data warehouse takes months to build. Most importantly failure rates of data warehousing projects are very high. Various studies have reported a failure rate of 50 to 60 percent for data warehouse implementations.
Over last 15 years, I have worked with dozens of clients ranging from world’s largest banks to start-up companies. One thing I’ve consistently experienced is that “Data warehouse projects do fail and failure occur for many reasons”. During my initial implementations of Big Data solutions, a client sponsor for a streaming big data project asked a question. ‘Can you build upon your past experience and tell me what are the probable reasons we may not succeed in this initiative’. Oh yes! I have a lot of reasons that caused failure of Data warehouse projects.
By ‘Big Data’, I am referring to analytical systems built on Hadoop, usage of big data technologies to augment existing data warehouses, systems built to analyse streaming data for predictive analytics purpose and other similar systems. By successful Data warehouse project, I am referring to projects that are delivered in given time and budget but more importantly, business users make use of the data warehouse for decision making.
Here is a list of common aspects that I have observed in failed Data warehouse implementations.
Many unsuccessful Data Warehouse implementations are characterize by Long development cycles. I have seen failed EDW projects which took 2 years for completion. During the project, there were changes in team composition, replacement in end users, skyrocket in budget. A CIO or a sponsor was more concerned about justifying the cost of the implementation from the data warehouse. Taking a clue from this aspect, big data implementations should avoid long development cycles. Agile methodology is a well suite for big data implementations considering exploratory nature of these projects. Short sprints of 2-3 weeks, continuous testing and deployment regular reprioritization of requirements is a must for Big Data projects.
Lack of focus on Data Quality
It has been a major irritation for business users to not use data warehouse. Many a time I have seen scenarios where data warehouse develop in budget, but data quality is poor. Lack of focus on quality testing, lack of source data understanding are some reasons for this. This situation is very likely in any big data project. Consider a scenario where a manufacturing company has implemented predictive maintenance system based on Hadoop. It has been tested and is now ready to predict failures. Now some incoming data from machine sensors does not get loaded due to data format errors and the system does not alert the user about impending failure. This will cause a huge disbelief in the system.
Treating data projects as pure IT projects than business endeavour
It has been a common factor in failed data warehouse implementations that I have observed in very large organizations. A head of Business Intelligence claiming that his team understands business requirements better than business users has been a common scenario. This is already happening in Big Data implementations. IT departments are implementing Hadoop clusters and pumping in data without a meaningful involvement of business community. The business community will question on implementation of spend and ROI.
Conducting weekly user demos, early release of partial data to business users and workshops to keep business users updated. It updates on how other companies are using similar data go a long way in keeping the business user community “involved” in the big data initiatives.
Data silos and data proliferation
During early 2000, number of data marts, data warehouses, personal data marts were developed in the mid –large organizations. Data extracted from operational systems multiple times into these analytical systems. Over a period, Data silos and data proliferation were developed. With ease of access to external / social media data, I foresee similar data silos and proliferation scenario with Big Data / Hadoop systems. Nothing can stop an analyst from dumping Facebook comments, census data, and government data, into her personal Hadoop datasets and run R algorithms on those.
Imagine doing it by multiple analysts in the same organization! Strong Metadata management system and data governance contribute to avoid Data silos and data proliferation, especially in large organizations. During later part of last decade, there were number of metadata and data governance initiatives launched in organizations that already had data warehouses. Rather than treating data governance as an afterthought, it should be part of big data implementations.
As fallout of treating data warehouse initiatives as IT projects, many a times, there is a lack of focus on non-functional requirements. Long running queries, slow response of reports, badly designed UI, long cycles of Data warehouse load all have contributed to killing many data warehouse initiatives. Apart from technology and data angle, big data projects will have to focus on these non-functional or ‘Usability’ aspects to get a buy in from business users.
In short, strong data governance, close involvement of business stakeholders, agile approach and focus on user experience are mandatory for success of a big data program.