5 Data Warehouse implementation mistakes to avoid in Big Data Projects
Data warehouse implementations are tricky. An enterprise data warehouse takes months to build, costs involved are very high. Most importantly failure rates of data warehousing projects are very high. Various studies have reported a failure rate of 50 to 60 percent for data warehouse implementations.
Over last 15 years, I have worked with dozens of clients ranging from world’s largest banks to start-up companies. Despite of variety of clients, domains and technologies, one thing I’ve consistently experienced is that “Data warehouse projects do fail and failure occur for many reasons”. During my initial implementations of Big Data solutions, a client sponsor for a streaming big data project asked a question, ‘Can you build upon your past experience and tell me what are the probable reasons we may not succeed in this initiative’. Oh yes! I have a lot of reasons that caused failure of Data warehouse projects and are causing big data projects to go that path.
By ‘Big Data’, I am referring to analytical systems built on Hadoop, usage of big data technologies to augment existing data warehouses, systems built to analyse streaming data for predictive analytics purpose and other similar systems. By successful Data warehouse project, I am referring to projects that are delivered in given time and budget but more importantly, business users make use of the data warehouse for decision making.
Here is a list of common aspects that I have observed in failed Data warehouse implementations and is likely to cause failure of big data implementations.
Many failed Data Warehouse implementations are characterized by Long development cycles. I have seen failed EDW projects which took 2 years for completion. During the project, the team composition changed, end users were replaced, and budgets skyrocketed. A CIO or a sponsor was more concerned about justifying the cost of the implementation that about achieving business results from the data warehouse. Taking a clue from this aspect, big data implementations should avoid long development cycles. Agile methodology is especially well suited for big data implementations considering exploratory nature of these projects. Short sprints of 2-3 weeks, continuous testing and deployment regular reprioritization of requirements is a must for Big Data projects.
Lack of focus on Data Quality has been a major irritation for business users to not use data warehouse. Many a time I have seen scenarios where data warehouse was developed in allotted budget, timelines were adhered to, but data quality is poor. Lack of focus on quality testing, lack of source data understanding are some reasons for this. This situation is very likely in any big data project. Consider a scenario where a manufacturing company has implemented predictive maintenance system based on Hadoop. It has been tested and is now ready to predict failures. Now some incoming data from machine sensors does not get loaded due to data format errors and the system does not alert the user about impending failure. This will cause a huge disbelief in the system.
Treating data projects as pure IT projects than business endeavors has been a common factor in failed data warehouse implementations that I have observed in very large organizations. A head of Business Intelligence claiming that his team understands business requirements better than business users has been a common scenario. This is already happening in Big Data implementations. IT departments are implementing Hadoop clusters and pumping in data without a meaningful involvement of business community. These implementations are going to be questioned for spend and ROI by the business community. We have experienced that conducting weekly user demos, early release of partial data to business users and workshops to keep business users updated on how other companies are using similar data go a long way in keeping the business user community “involved” in the big data initiatives.
During late 90s and early 2000, number of data marts, data warehouses, personal data marts (in SAS, Excel, MS Access) were developed in most of the mid –large organizations. Data was extracted from operational systems multiple times into these analytical systems. Over a period, Data silos and data proliferation were developed. With ease of access to external / social media data, I foresee similar data silos and proliferation scenario with Big Data / Hadoop systems. Nothing can stop an analyst from dumping Facebook comments, census data, and government data into her personal Hadoop datasets and run R algorithms on those. Imagine if this is done by multiple analysts in the same organization! Strong Metadata management system and data governance contribute to avoid Data silos and data proliferation, especially in large organizations. During later part of last decade, there were number of metadata and data governance initiatives launched in organizations that already had data warehouses. Rather than treating data governance as an afterthought, it should be part of big data implementations.
As fallout of treating data warehouse initiatives as IT projects, many a times, there is a lack of focus on non-functional requirements. Long running queries, slow response of reports, badly designed UI, long cycles of Datawarehouse load all have contributed to killing many data warehouse initiatives. Apart from technology and data angle, big data projects will have to focus on these non-functional or ‘Usability’ aspects to get a buy in from business users.
In short, strong data governance, close involvement of business stakeholders, agile approach and focus on user experience are mandatory for success of a big data program.