As
more and more organizations move towards adopting Big Data Platforms, the need
for effectively managing and maintaining the quality of the Data becomes more
evident. There are organizations that are either :
·
migrating from legacy systems to the newer
systems or
·
that retain their large legacy systems and add
the newer Platforms as an overlay to be able to perform large-scale operations
In
any case, data gets moved around from legacy systems into the newer Platforms.
Most commonly, data from multiple
sources lands into these newer systems.
Data Quality Dimensions
This
raises a need for maintaining the quality of the Data on the Platform. That
automatically creates a need for metrics to be able to measure the Data
Quality. The first step would be in defining criteria and dimensions for Data
Quality. Here are a few generally known key dimensions that go into defining
Data Quality :
- Data Consistency – how consistent is the data across the different systems that hold copies of it
- Data Accuracy – how correct and accurate is the data
- Data Completeness – did any parts of the data get lost in transmission?
- Data Conformity or syntactic accuracy – whether it confirms to Data syntax & format rules
- Data Integrity – whether all the data parts add up correctly to a whole and have the right linkages & relationships intact.
- Data Uniqueness – whether there is any spurious or duplicate data in the data set such as duplicate rows, duplicated entities and so on
- Data Timeliness – whether the data is stored in a format or a manner for it to be retrievable in a timely manner and it has all of the above qualities validated at that given instant of time
The
last is not really an inherent dimension of Data Quality but rather a
superimposing framework that demands all of the 6 dimensions of Data Quality at
any given time.
Measuring Data Quality
Now
the above are generally well-understood through the industry. No rocket science
there. The important question that gets
raised in the scenario described at the beginning of this article, is how does
one measure the quality of the data when it moves from legacy systems to the
Big Data Platforms. If you take Data
Completeness for example – was the Data Complete in the legacy system itself –
or did it get truncated or mangled when coming into the Big Data Platform or
got truncated after coming onto the Big Data Platform?
This
raises a need for measuring Data Quality :
- · Of the Data flowing into a system
- · During storage over time to ensure that the quality does not deteriorate
- · Before & after – if there is a process for cleaning/ enriching the data
Measuring
the before & after quality of the Data when there is a Data Cleansing
process or system, will indicate the effectiveness of the Data Cleansing system,
or lack thereof.
Now
writing a Data quality measurement mechanism for something like Data Conformity is relatively easy,
since it relies on syntactic rules. One could measure the number of rule violations
and come up with an index. Similar rules could be written for Data Completeness, Data Integrity and Data
Uniqueness.
Data Consistency measurement is
relatively harder esp when one looks at it from the Data Timeliness angle. It would need a system that can have a
simultaneous, atomic peek at both the source & destination systems to
compare. One could have a light-weight hashing system that gives a yes or no
answer. But if it was only one bit that was amiss – you wouldn’t want to throw
away that whole data instance you just received because the hash was wrong.
This requires a more intelligent system that does a trade-off between smaller
data size hashes and smaller retransmissions. One could make it a more adaptive
system where the data size is larger when the quality is better and smaller
when it is worse.
Data
Accuracy, Data Integrity and Data Uniqueness have some additional complexity to
them. There will have to be some heuristic on top of the rules that will have
to be applied. Some could be as simple
as a spell-check for text fields. But again a spelling syntactically correct,
may still be inaccurate in that context and an inaccurate spelling might be a
valid entry. For numeric data – there will have to be some heuristic rules that
are set and some that will have to be deduced from the data itself. This brings
a potential application of Data Science/Machine Learning to apply to the data
to find the most common ranges. It would need some intervention to make sure
the system is not learning from a highly corrupt data-set.