Large
organizations that have been around for 10, 20 years or more, evidently have
massive amounts of data in their several different sub-organizations, departments
and/or business units. For example – Sales data, customer ticket/case data,
finance data, product data, employee data, project data, customer satisfaction
survey data and more. The larger the company and the older it is – the more
massive and the more distributed, segmented and siloed the data tends to be.
Primarily because data would have been captured and grown more as a by-product
of the main business and became an after-thought. Often – quick delivery cycles
and the need to react would have resulted in each department/business unit
quickly putting together a data strategy and as a result an IT strategy for
their own department.
The growth of Data to Big Data
With
the growth of business and data – each department has started in the recent
years, feeling the need for Analytics and as a result for technology that
supports analyzing the data in a reasonable amount of time. What has resulted
is a complex eco-system of siloed data sets with each department putting
together a siloed Big Data & Analytics strategy. Soon they realize the
inter-dependencies with other departments and request data feeds from those
into their platforms.
The dawn of UDLs & Data Lakes
A Data Lake |
With
each inter-related department getting similar request for data feeds from each
other, some of these have started realizing that there is value to be had in
storing data on a common platform or at least having a common reconciled view
of the data-sets. For example, those
that maintain customer contracts data, those that maintain customer support
data, those that maintain sales data (to name a few) have inter-dependent and
even overlapping data that needs to be reconciled & normalized. In the
process what was born is the concept of UDLs aka Unified Data Layers and
another recent concept of Data Lakes. Yes – the use of the plural is deliberate
– as many such so-called UDLs or instances thereof, tend to proliferate within
companies. A Data Lake is a more amorphous version of the UDLs as it pretty
much serves as a Data dump and a pull out. Some have tended to add structure to
it calling it a Data Reservoir. But in essence it still remains unmanaged and
unthrottled. UDLs is a more structured, managed version of the shared data
Sharing data from the UDLs
When
other departments that are dependent on some of the fields in the data sets in
the so-called UDLs, or the applications that use those, find out about the
UDLs, the instant reaction is to ask for Data APIs from the UDLs. The UDLs are
supposedly housed on Platforms with massive Data storage capability aka Big
Data capability. The older days used to be a rush to Data Warehouses, but Data
users & API requesters often realized that getting anything out of an EDW
is like standing in line on the 1st day of opening of the Freedom
tower or Statue of Liberty. Hence the newer choice for the UDLs became Hadoop/Mapreduce
& in even newer days – Apache SPARK, due to their massively parallel
processing ability and hence the ability to serve as many Data API Reads
instantaneously.
A pure UDL is like a library of books
The thinking stems
from the fact that traditional organizations can bring themselves to have a
common master data repository or a single source of truth and treat it like a
library of books from which you can check in and check out data for use at will
almost instantly. The library will take on the task of appropriately
organizing, labeling, de-duplicating and fixing any labeling, categorizing
errors with books. But it will not help you with finding the trend, finding the
precise nuggets of information that are hidden within the books.
Cost of a pure UDL
The
localized trend of having UDLs and having Data APIs for different Departments,
Applications or users to draw on raw data from the UDLs and then analyzing
& processing it on their own respective platforms; is like 10 different
students checking out 90% of the same reference books from the library and
taking notes and writing synopsis which is probably 90% overlapping and the 11th
student doing it all over again since there are no shared insight finding and
sharing forums.
In
case of technology – it is worse. It involves replication 10 times over of :
1. The cost of infrastructure to replicate, store & process the data
2. The data science time to analyze and build models
3. The engineering time to build software that automates the analysis
4. Separate licenses for any tools that aid in the visualization/analysis
5. The time to replicate, process and analyze the data
When
each department has its own Big Data &/or Analytics Platform – but has a
UDL that it dumps data into and maintains as a single source of truth, it saves
on the costs of Data cleansing, reconciliation, normalization and enrichment.
This is definitely a non-trivial cost.
Since
it can do so much more than just store and offer up data like a library that
offers up books. The departments are comfortable sharing data in the UDL
(sometimes forcibly so or more out of necessity), but they are not comfortable
with sharing a platform that can deduce the insights from the books in the
library. It is very surprising to see the resultant under utilization of
powerful technologies such as HDFS, SPARK & Elastic Search, in platforms
that power these UDLS.
The
pure UDLs miss out on the savings these technologies can provide, in the form
of :
- Shared processing and Analytics platform that can
- Share the parallel processing technology
- Make use of off-peak times of some applications
- Save processing time and cost by sharing insights
- Save engineering & data science time by not having to duplicate analysis and development
A shared Big Data (UDL) & Analytics Platform
What’s needed is, for the company to have a centralized yet federated strategy for Big Data & Analytics. Wherein a UDL is not a stand alone, but rather a sub-component of the shared Big Data, Analytics platform. The characteristics of such a platform should be :
- A massive, elastic shared platform that can grow & scale as per need
- Has a single instance of all different data-sets
- Where all data sets are cleansed, reconciled and normalized centrally
- Data access privileges are enforced through a federated Data Governance policy (more on this in an upcoming blog post)
- Shared infrastructure for analyzing & processing the data in a federated manner
- Analyzed information and Analytics insights are filed away like reference material in a library and shared under the Federated Data Governance policy.
This would save companies Millions – potentially even Billions of dollars and yield faster, more accurate insights, projections and business outcomes.
No comments:
Post a Comment