As more and more organizations move towards adopting Big Data Platforms, the need for effectively managing and maintaining the quality of the Data becomes more evident. There are organizations that are either :

· migrating from legacy systems to the newer systems or

· that retain their large legacy systems and add the newer Platforms as an overlay to be able to perform large-scale operations

In any case, data gets moved around from legacy systems into the newer Platforms. Most commonly, data from multiple sources lands into these newer systems.

Data Quality Dimensions

This raises a need for maintaining the quality of the Data on the Platform. That automatically creates a need for metrics to be able to measure the Data Quality. The first step would be in defining criteria and dimensions for Data Quality. Here are a few generally known key dimensions that go into defining Data Quality :

Data Consistency – how consistent is the data across the different systems that hold copies of it
Data Accuracy – how correct and accurate is the data
Data Completeness – did any parts of the data get lost in transmission?
Data Conformity or syntactic accuracy – whether it confirms to Data syntax & format rules
Data Integrity – whether all the data parts add up correctly to a whole and have the right linkages & relationships intact.
Data Uniqueness – whether there is any spurious or duplicate data in the data set such as duplicate rows, duplicated entities and so on
Data Timeliness – whether the data is stored in a format or a manner for it to be retrievable in a timely manner and it has all of the above qualities validated at that given instant of time

The last is not really an inherent dimension of Data Quality but rather a superimposing framework that demands all of the 6 dimensions of Data Quality at any given time.

Measuring Data Quality

Now the above are generally well-understood through the industry. No rocket science there. The important question that gets raised in the scenario described at the beginning of this article, is how does one measure the quality of the data when it moves from legacy systems to the Big Data Platforms. If you take Data Completeness for example – was the Data Complete in the legacy system itself – or did it get truncated or mangled when coming into the Big Data Platform or got truncated after coming onto the Big Data Platform?

This raises a need for measuring Data Quality :

· Of the Data flowing into a system
· During storage over time to ensure that the quality does not deteriorate
· Before & after – if there is a process for cleaning/ enriching the data

Measuring the before & after quality of the Data when there is a Data Cleansing process or system, will indicate the effectiveness of the Data Cleansing system, or lack thereof.

Now writing a Data quality measurement mechanism for something like Data Conformity is relatively easy, since it relies on syntactic rules. One could measure the number of rule violations and come up with an index. Similar rules could be written for Data Completeness, Data Integrity and Data Uniqueness.

Data Consistency measurement is relatively harder esp when one looks at it from the Data Timeliness angle. It would need a system that can have a simultaneous, atomic peek at both the source & destination systems to compare. One could have a light-weight hashing system that gives a yes or no answer. But if it was only one bit that was amiss – you wouldn’t want to throw away that whole data instance you just received because the hash was wrong. This requires a more intelligent system that does a trade-off between smaller data size hashes and smaller retransmissions. One could make it a more adaptive system where the data size is larger when the quality is better and smaller when it is worse.

Data Accuracy, Data Integrity and Data Uniqueness have some additional complexity to them. There will have to be some heuristic on top of the rules that will have to be applied. Some could be as simple as a spell-check for text fields. But again a spelling syntactically correct, may still be inaccurate in that context and an inaccurate spelling might be a valid entry. For numeric data – there will have to be some heuristic rules that are set and some that will have to be deduced from the data itself. This brings a potential application of Data Science/Machine Learning to apply to the data to find the most common ranges. It would need some intervention to make sure the system is not learning from a highly corrupt data-set.

All in all – it is possible to design and put into effect a Data Quality measurement system for getting a handle on the quality of large data that moves around. An organization would have to make a strategic decision to invest in it. The benefits and savings of having accurate data, far outweigh the disasters, costs & losses that could result from basing your business on inaccurate data – even worse, when there are

Large organizations that have been around for 10, 20 years or more, evidently have massive amounts of data in their several different sub-organizations, departments and/or business units. For example – Sales data, customer ticket/case data, finance data, product data, employee data, project data, customer satisfaction survey data and more. The larger the company and the older it is – the more massive and the more distributed, segmented and siloed the data tends to be. Primarily because data would have been captured and grown more as a by-product of the main business and became an after-thought. Often – quick delivery cycles and the need to react would have resulted in each department/business unit quickly putting together a data strategy and as a result an IT strategy for their own department.

The growth of Data to Big Data

With the growth of business and data – each department has started in the recent years, feeling the need for Analytics and as a result for technology that supports analyzing the data in a reasonable amount of time. What has resulted is a complex eco-system of siloed data sets with each department putting together a siloed Big Data & Analytics strategy. Soon they realize the inter-dependencies with other departments and request data feeds from those into their platforms.

The dawn of UDLs & Data Lakes

A Data Lake

With each inter-related department getting similar request for data feeds from each other, some of these have started realizing that there is value to be had in storing data on a common platform or at least having a common reconciled view of the data-sets. For example, those that maintain customer contracts data, those that maintain customer support data, those that maintain sales data (to name a few) have inter-dependent and even overlapping data that needs to be reconciled & normalized. In the process what was born is the concept of UDLs aka Unified Data Layers and another recent concept of Data Lakes. Yes – the use of the plural is deliberate – as many such so-called UDLs or instances thereof, tend to proliferate within companies. A Data Lake is a more amorphous version of the UDLs as it pretty much serves as a Data dump and a pull out. Some have tended to add structure to it calling it a Data Reservoir. But in essence it still remains unmanaged and unthrottled. UDLs is a more structured, managed version of the shared data

Sharing data from the UDLs

When other departments that are dependent on some of the fields in the data sets in the so-called UDLs, or the applications that use those, find out about the UDLs, the instant reaction is to ask for Data APIs from the UDLs. The UDLs are supposedly housed on Platforms with massive Data storage capability aka Big Data capability. The older days used to be a rush to Data Warehouses, but Data users & API requesters often realized that getting anything out of an EDW is like standing in line on the 1^st day of opening of the Freedom tower or Statue of Liberty. Hence the newer choice for the UDLs became Hadoop/Mapreduce & in even newer days – Apache SPARK, due to their massively parallel processing ability and hence the ability to serve as many Data API Reads instantaneously.

A pure UDL is like a library of books

The thinking stems from the fact that traditional organizations can bring themselves to have a common master data repository or a single source of truth and treat it like a library of books from which you can check in and check out data for use at will almost instantly. The library will take on the task of appropriately organizing, labeling, de-duplicating and fixing any labeling, categorizing errors with books. But it will not help you with finding the trend, finding the precise nuggets of information that are hidden within the books.

Cost of a pure UDL

The localized trend of having UDLs and having Data APIs for different Departments, Applications or users to draw on raw data from the UDLs and then analyzing & processing it on their own respective platforms; is like 10 different students checking out 90% of the same reference books from the library and taking notes and writing synopsis which is probably 90% overlapping and the 11^th student doing it all over again since there are no shared insight finding and sharing forums.

In case of technology – it is worse. It involves replication 10 times over of :

1.     The cost of infrastructure to replicate, store & process the data
2.     The data science time to analyze and build models
3.     The engineering time to build software that automates the analysis
4.     Separate licenses for any tools that aid in the visualization/analysis
5.     The time to replicate, process and analyze the data

When each department has its own Big Data &/or Analytics Platform – but has a UDL that it dumps data into and maintains as a single source of truth, it saves on the costs of Data cleansing, reconciliation, normalization and enrichment. This is definitely a non-trivial cost.

Since it can do so much more than just store and offer up data like a library that offers up books. The departments are comfortable sharing data in the UDL (sometimes forcibly so or more out of necessity), but they are not comfortable with sharing a platform that can deduce the insights from the books in the library. It is very surprising to see the resultant under utilization of powerful technologies such as HDFS, SPARK & Elastic Search, in platforms that power these UDLS.

The pure UDLs miss out on the savings these technologies can provide, in the form of :

Shared processing and Analytics platform that can
Share the parallel processing technology
Make use of off-peak times of some applications
Save processing time and cost by sharing insights
Save engineering & data science time by not having to duplicate analysis and development

A shared Big Data (UDL) & Analytics Platform

What’s needed is, for the company to have a centralized yet federated strategy for Big Data & Analytics. Wherein a UDL is not a stand alone, but rather a sub-component of the shared Big Data, Analytics platform. The characteristics of such a platform should be :

A massive, elastic shared platform that can grow & scale as per need

Has a single instance of all different data-sets

Where all data sets are cleansed, reconciled and normalized centrally

Data access privileges are enforced through a federated Data Governance policy (more on this in an upcoming blog post)

Shared infrastructure for analyzing & processing the data in a federated manner

Analyzed information and Analytics insights are filed away like reference material in a library and shared under the Federated Data Governance policy.

This would save companies Millions – potentially even Billions of dollars and yield faster, more accurate insights, projections and business outcomes.

Big Data & Analytics

Friday, July 24, 2015

Data Quality Measurement – a Data Science application