Getting Fast ROI With Your Big Data Projects

by Vineet Tyagi, CTO, Impetus

The world today is heavily networked and social, we as a society are living out our digital lives generating amounts of data never seen before. Enterprise CIO's and CTO’s are being called out to face several unique challenges. Firstly, they have to deal with increasing amounts of data being generated at an increasing rate, this data is not necessarily structured and cannot be stored and analyzed by RDBMS engines. Secondly, generation of deeper and more valuable insights requires integrating multiple internal silo’s of information with social data, something the current data warehouses are not geared up for. Thirdly, existing warehouses are being asked to support application of Data Science algorithms for predictive analytics, again something they are not geared up to do. Last and most importantly, with an exponential rise in data assets being managed today the cost of storing and analyzing in the current data warehouse is spiraling out of control. Enterprise CIO’s and CTO’s are looking for ways to reduce the costs along with reduce time to insights by Faster Processing/ Analytics. 

A Hadoop Warehouse architecture is in essence a massive, easily accessible, flexible and scalable data repository built on inexpensive computer hardware. Hadoop can store uncategorized pools of data “as is”, including data immediately of interest, data potentially of interest and data for which the intended usage is not yet known. The Hadoop Warehouse brings newer capabilities and insights to business users which include but not limited to the following:

Active Archive: One place to store all your data, in any format, at any volume, for as long as you like, allowing you to address compliance requirements and deliver data on demand to satisfy internal and external regulatory demands.

Transformation and Processing: ETL workloads that previously had to run on expensive systems can migrate to the Hadoop Warehouse, where they run at very low cost, in parallel, much faster than before. Optimizing the placement of these workloads and the data on which they operate frees the capacity on high-end analytic and data warehouse systems, making them more valuable by allowing them to concentrate on the business- critical OLAP and other applications that they run.

Self-Service Exploratory BI: Users frequently want access to enterprise data for reporting, exploration, and analysis. Production enterprise data warehouse systems must often be protected from casual use so they can run the mission-critical financial and operational workloads they support. A Hadoop Warehouse on the contrary allows users to explore data, with full security, most traditional interactive business intelligence tools support Hadoop connectivity today.

Advanced Analytics: Multiple computing frameworks that enable analytics, search, machine learning and more unlock value in new and old data sources. 

Several fortune 1000 companies who are adopting Hadoop powered Big Data warehouse as a replacement for the long-successful data warehouse solutions that use SQL to lower cost of data retention and produce better and faster BI follow a four stage strategy for building the Hadoop Warehouse.

Stage 1: Handle and Ingest Data at scale, as the first stage, the organization needs to determine the existing and new data source that it can leverage. The data sources are integrated and the variety of voluminous data is ingested at high velocity in Hadoop storage.

Stage 2: Building the Analytical Muscle, leveraging the data stored in Hadoop Warehouse, the organization builds batch, mini-batch and real time applications for enterprise usage, exploratory analytics and predictive use cases. Various tools and frameworks are utilized in this stage.

Stage 3: Stage 3: Enterprise Data Warehouse and Hadoop Warehouse Work in Unison, in a real world scenario, enterprise data warehouse (EDW) and Hadoop based warehouse would co-exist. The organization would create a strategy to leverage data and specific capabilities from each to its advantage.

Stage 4: Enterprise Capability in Hadoop Warehouse, broad adoption and confidence in Hadoop leads to retiring of existing warehouses and Hadoop being the platform of choice with full information governance, Meta data management and information lifecycle management capabilities.

Lastly, the top 3 tactics for immediate ROI from my experience are to use a Hadoop Warehouse for 

1. Pre-Processing: Use Hadoop as a “landing zone” before determining what data should be moved to the data warehouse and persist the data to provide a query able archive of cold data.

2. Offloading: Move infrequently accessed data from data warehouses into enterprise-grade Hadoop and also move associated workloads to be serviced from Hadoop. Additionally, leverage Hadoop’s large-scale batch processing efficiencies to preprocess and transform data for the warehouse.

3. Exploration: Use Hadoop to explore and discover new high value data from massive amounts of raw data and free up the data warehouse for more structured, deep analytics. Hadoop could also be used as an environment for ad hoc data discovery.