The Data Warehouse Of The Future

by Anup Purohit CIO, Yes Bank

In today’s digital world, data is the single most valuable asset for or­ganizations. The large number of ecosystems assimilating all this data and analyzing it, to determine the hidden nuggets are a testament to its overarching importance. Any medi­um to large organization has a Data Warehouse implementation as one of its most strategic initiatives. The insights from a data warehouse are used for a myriad of purposes in­cluding but not limited to analytical campaigns, business and portfolio monitoring, performance analysis and risk management. Having said that, most of the data warehouses across organizations fail or do not bring the value in line with the in­curred costs.

Let us first understand what a data warehouse is and what a data warehouse is not. A DATA WAREHOUSE IS A CONCEPT. It is NOT A TECHNOLOGY. Essentially, a data warehouse is a representation of all the business processes that constitute an organi­zation. They are repositories of data extracted, cleansed, transformed and integrated from multiple dispa­rate systems to provide an easy and comprehensible data reporting and analysis bed. Depending upon the business function, data warehouses are required to store both current and historical data.

The confusion around data ware­house being mistaken as a piece of technology stems from the fact that traditionally all data warehouses have been implemented on Rela­tional databases (RDBMS). So it is not the concept which is a failure, but it is technological ecosystem hosting the concept which is posing challenges to organizations. Here are the top limitations of traditional data warehouse technologies:

  • Traditional RDMBS’ are well suit­ed for lower volumes of data. For larger volumes, we have engineered systems which are extremely com­plex to administer
  • These become cost prohibitive when the data sets start increasing
  • A lot of data warehouses are very rigid and are not flexible to adopt in-line with dynamic changes in organi­zation and requirements

As we live in a ‘DATA Age’, I be­lieve the ‘Data Warehouse of Future’ needs to provide organizations with following capabilities:

1. It needs to handle large data vol­umes. We are digitizing every facet of organizations hence generating data like never before. The Data Warehouse of Future should be able to ingest and work upon large data sets

2. New and new forms of data are merging. Semi-structured and un­structured data sets are providing insights which traditional structure have not provided earlier. The Data Warehouse of Future should be able to handle these varied data sets and merge them seamlessly with struc­tured data

3. The Data Warehouse of Future has to be analytical. No longer is it sufficient to integrate as-is/cleansed data from organizational systems. It is of utmost importance to elicit hidden information using analytical models and integrate into main­stream of data warehouse for con­sumption by business users. So Data Warehouse of Future should be able to manage analytical capabilities as part of routine workloads

4. The next generation data ware­houses need to be as near real time as possible. The Data Warehouse of Future should be able to ingest large amounts of streaming data in a scal­able manner

5. The Data Warehouse of Future need to be flexible and agile enough to ingest new data very quickly and easily. The traditional data ware­house have been extremely rigid and incorporation of new data elements becomes a project in itself

The traditional technologies have been able to address some but not all of the above-stated limitations. So in order to make data warehouse concept a successful proposition in an organization, it is imperative to scout for newer technologies and look for alternatives to traditional technologies. One such technology which stands out is Hadoop.

Hadoop is essentially an eco­system for data analysis with many technologies under its umbrella. While you have Sqoop and Flume to handle structured and unstructured data, you have Pig and Spark to han­dle jobs. You also have Kafka and Spark Streaming to handle real time data and then there is a foray into providing interactive query capabil­ities like Impala to query on large data sets. Powering all of that is a distributed and persistent file system HDFS which has resulted into for­mulation of term called Data Lake where organizations can store data in a very cost effective manner and use the aforementioned technologies for various tasks and purposes.

Having stated the above, is Ha­doop an answer for every require­ment? My observation to that is a clear No. Traditional database technologies have been in existence and have matured in span of 30+ years and hence provide a very ma­ture stable and functionally rich SQL and programming interfaces to work with the data. Further the data governance and security capa­bilities are highly advanced as com­pared to Hadoop. Additionally, the pool of skilled resources around this technology is immensely large. Ha­doop as technology is still maturing in terms of its SQL prowess and the resource skill sets required is yet to become main-stream.

So how does an organization ad­dress the data warehouse problem in the most optimum manner? I believe the answer to this lies in following ’Horses for Courses’ approach. An organization needs to maximize the pros of each technology to best possible extent. Use the traditional RDBMS technologies for analysis on recent /current data and leverage the power of Big Data platform to run large jobs on historical data and also provide a low cost alternative to your large storage. Therefore I see a hybrid structure as the most opti­mized solution to this problem.

This is indeed a space which has eye of most technological pundits across the globe. While the active Hadoop community focuses its re­search and development on quick­ly providing and maturing their SQL capabilities thereby providing RDBMS kind of flavour on the Big Data platform, the traditional da­tabases technologies is leveraging Big Data concepts to provide paral­lel and distributed processing. Will there be a clear winner above or we will see an alliance between these, only time will tell.