Monday, July 20, 2015

Data Lakes, UDLs vs. Analytics Platforms


Large organizations that have been around for 10, 20 years or more, evidently have massive amounts of data in their several different sub-organizations, departments and/or business units. For example – Sales data, customer ticket/case data, finance data, product data, employee data, project data, customer  satisfaction survey data and more. The larger the company and the older it is – the more massive and the more distributed, segmented and siloed the data tends to be. Primarily because data would have been captured and grown more as a by-product of the main business and became an after-thought. Often – quick delivery cycles and the need to react would have resulted in each department/business unit quickly putting together a data strategy and as a result an IT strategy for their own department.

The growth of Data to Big Data

With the growth of business and data – each department has started in the recent years, feeling the need for Analytics and as a result for technology that supports analyzing the data in a reasonable amount of time. What has resulted is a complex eco-system of siloed data sets with each department putting together a siloed Big Data & Analytics strategy. Soon they realize the inter-dependencies with other departments and request data feeds from those into their platforms.

The dawn of UDLs & Data Lakes

A Data Lake
With each inter-related department getting similar request for data feeds from each other, some of these have started realizing that there is value to be had in storing data on a common platform or at least having a common reconciled view of the data-sets.  For example, those that maintain customer contracts data, those that maintain customer support data, those that maintain sales data (to name a few) have inter-dependent and even overlapping data that needs to be reconciled & normalized. In the process what was born is the concept of UDLs aka Unified Data Layers and another recent concept of Data Lakes. Yes – the use of the plural is deliberate – as many such so-called UDLs or instances thereof, tend to proliferate within companies. A Data Lake is a more amorphous version of the UDLs as it pretty much serves as a Data dump and a pull out. Some have tended to add structure to it calling it a Data Reservoir. But in essence it still remains unmanaged and unthrottled. UDLs is a more structured, managed version of the shared data

Sharing data from the UDLs


When other departments that are dependent on some of the fields in the data sets in the so-called UDLs, or the applications that use those, find out about the UDLs, the instant reaction is to ask for Data APIs from the UDLs. The UDLs are supposedly housed on Platforms with massive Data storage capability aka Big Data capability. The older days used to be a rush to Data Warehouses, but Data users & API requesters often realized that getting anything out of an EDW is like standing in line on the 1st day of opening of the Freedom tower or Statue of Liberty. Hence the newer choice for the UDLs became Hadoop/Mapreduce & in even newer days – Apache SPARK, due to their massively parallel processing ability and hence the ability to serve as many Data API Reads instantaneously.

A pure UDL is like a library of books

The thinking stems from the fact that traditional organizations can bring themselves to have a common master data repository or a single source of truth and treat it like a library of books from which you can check in and check out data for use at will almost instantly. The library will take on the task of appropriately organizing, labeling, de-duplicating and fixing any labeling, categorizing errors with books. But it will not help you with finding the trend, finding the precise nuggets of information that are hidden within the books.

Cost of a pure UDL

The localized trend of having UDLs and having Data APIs for different Departments, Applications or users to draw on raw data from the UDLs and then analyzing & processing it on their own respective platforms; is like 10 different students checking out 90% of the same reference books from the library and taking notes and writing synopsis which is probably 90% overlapping and the 11th student doing it all over again since there are no shared insight finding and sharing forums.

In case of technology – it is worse. It involves replication 10 times over of :

1.     The cost of infrastructure to replicate, store & process the data
2.     The data science time to analyze and build models
3.     The engineering time to build software that automates the analysis
4.     Separate licenses for any tools that aid in the visualization/analysis
5.     The time to replicate, process and analyze the data

When each department has its own Big Data &/or Analytics Platform – but has a UDL that it dumps data into and maintains as a single source of truth, it saves on the costs of Data cleansing, reconciliation, normalization and enrichment. This is definitely a non-trivial cost.

Since it can do so much more than just store and offer up data like a library that offers up books. The departments are comfortable sharing data in the UDL (sometimes forcibly so or more out of necessity), but they are not comfortable with sharing a platform that can deduce the insights from the books in the library. It is very surprising to see the resultant under utilization of powerful technologies such as HDFS, SPARK & Elastic Search, in platforms that power these UDLS.

     The pure UDLs miss out on the savings these technologies can provide, in the form of :

  1. Shared processing and Analytics platform that can
  2. Share the parallel processing technology
  3. Make use of off-peak times of some applications
  4. Save processing time and cost by sharing insights
  5. Save engineering & data science time by not having to duplicate analysis and development

A shared Big Data (UDL) & Analytics Platform

What’s needed is, for the company to have a centralized yet federated strategy for Big Data & Analytics. Wherein a UDL is not a stand alone, but rather a sub-component of the shared Big Data, Analytics platform. The characteristics of such a platform should be :

  • A massive, elastic shared platform that can grow & scale as per need
  • Has a single instance of all different data-sets
  • Where all data sets are cleansed, reconciled and normalized centrally
  • Data access privileges are enforced through a federated Data Governance policy (more on this in an upcoming blog post)
  • Shared infrastructure for analyzing & processing the data in a federated manner
  • Analyzed information and Analytics insights are filed away like reference material in a library and shared under the Federated Data Governance policy.

This would save companies Millions – potentially even Billions of dollars and yield faster, more accurate insights, projections and business outcomes.

No comments:

Post a Comment