As more and more organizations move towards adopting Big Data Platforms, the need for effectively managing and maintaining the quality of the Data becomes more evident. There are organizations that are either :
· migrating from legacy systems to the newer systems or
· that retain their large legacy systems and add the newer Platforms as an overlay to be able to perform large-scale operations
In any case, data gets moved around from legacy systems into the newer Platforms. Most commonly, data from multiple sources lands into these newer systems.
Data Quality Dimensions
This raises a need for maintaining the quality of the Data on the Platform. That automatically creates a need for metrics to be able to measure the Data Quality. The first step would be in defining criteria and dimensions for Data Quality. Here are a few generally known key dimensions that go into defining Data Quality :
- Data Consistency – how consistent is the data across the different systems that hold copies of it
- Data Accuracy – how correct and accurate is the data
- Data Completeness – did any parts of the data get lost in transmission?
- Data Conformity or syntactic accuracy – whether it confirms to Data syntax & format rules
- Data Integrity – whether all the data parts add up correctly to a whole and have the right linkages & relationships intact.
- Data Uniqueness – whether there is any spurious or duplicate data in the data set such as duplicate rows, duplicated entities and so on
- Data Timeliness – whether the data is stored in a format or a manner for it to be retrievable in a timely manner and it has all of the above qualities validated at that given instant of time
The last is not really an inherent dimension of Data Quality but rather a superimposing framework that demands all of the 6 dimensions of Data Quality at any given time.
Measuring Data Quality
Now the above are generally well-understood through the industry. No rocket science there. The important question that gets raised in the scenario described at the beginning of this article, is how does one measure the quality of the data when it moves from legacy systems to the Big Data Platforms. If you take Data Completeness for example – was the Data Complete in the legacy system itself – or did it get truncated or mangled when coming into the Big Data Platform or got truncated after coming onto the Big Data Platform?
This raises a need for measuring Data Quality :
- · Of the Data flowing into a system
- · During storage over time to ensure that the quality does not deteriorate
- · Before & after – if there is a process for cleaning/ enriching the data
Measuring the before & after quality of the Data when there is a Data Cleansing process or system, will indicate the effectiveness of the Data Cleansing system, or lack thereof.
Now writing a Data quality measurement mechanism for something like Data Conformity is relatively easy, since it relies on syntactic rules. One could measure the number of rule violations and come up with an index. Similar rules could be written for Data Completeness, Data Integrity and Data Uniqueness.
Data Consistency measurement is relatively harder esp when one looks at it from the Data Timeliness angle. It would need a system that can have a simultaneous, atomic peek at both the source & destination systems to compare. One could have a light-weight hashing system that gives a yes or no answer. But if it was only one bit that was amiss – you wouldn’t want to throw away that whole data instance you just received because the hash was wrong. This requires a more intelligent system that does a trade-off between smaller data size hashes and smaller retransmissions. One could make it a more adaptive system where the data size is larger when the quality is better and smaller when it is worse.
Data Accuracy, Data Integrity and Data Uniqueness have some additional complexity to them. There will have to be some heuristic on top of the rules that will have to be applied. Some could be as simple as a spell-check for text fields. But again a spelling syntactically correct, may still be inaccurate in that context and an inaccurate spelling might be a valid entry. For numeric data – there will have to be some heuristic rules that are set and some that will have to be deduced from the data itself. This brings a potential application of Data Science/Machine Learning to apply to the data to find the most common ranges. It would need some intervention to make sure the system is not learning from a highly corrupt data-set.