Listen to this article:
They probably are. That’s right, I said it. Most Data Quality metrics are useless. Of course, that assumes that you have any to begin with. And by the time you finish this article you’ll agree with me. Maybe it won’t even take that long. Let’s start with a line from a Data Quality report sent weekly to leadership:
The quality of the customer_zip_code field in the transaction table is holding steady at 99.4%.
That sounds pretty good. But what do you really know about the customer_zip_code field? Is leadership being bamboozled? I have an analysis that needs customer zip code. Should I use this field?
One of the first articles I wrote when I started this blog more than two years ago was called, “Data Quality ≠ Fit for Purpose.” In it, I used the example of a junker car purchased cheaply that only had to get the new owner a couple miles to and from work each day. The requirements were low price and short distance. The car satisfied the requirements. It was “fit for purpose.” It was therefore of high quality.
Someone else might need a car to travel long distances without breaking down. This car couldn’t do that. It did not satisfy the requirements. It was not “fit for purpose.” It was therefore of low quality. I found these conclusions perplexing:
The same car was both high quality and low quality at the same time.
When Data Quality is considered from the “fit for purpose” perspective, the same data can be both high quality and low quality simultaneously, depending upon the intended purpose. Each set of requirements becomes a separate definition of quality for the same object.
I addressed this situation by differentiating Data Quality from Data Fitness:
Data Quality is the degree to which the data conforms to the requirements for which it was created.
Data Fitness is the degree to which the data conforms to the requirements for which it is being considered for use.
Viewed in this way, the question changes from, “Is this data of high quality?” to “Does this data satisfy my needs?”
But doesn’t that just deposit us right back in the land of “fit for purpose” where we started? No, because not all requirements are created equally. Data doesn’t just happen. It has to be created by a computer program written for some purpose according to some set of requirements. These are the direct consumers.
Over the past four decades, we’ve come to take for granted that operational data will also be consumed by analytical systems. These are the incidental consumers. This explains why the development team ignores data-related problems reported by the analytics team, but drops everything to correct what are essentially data-related problems reported by the downstream operational consumers. Analytical requirements are not as important. I’m not saying I agree, but that’s the way it is (Data Products, properly implemented, mitigate this challenge).
Separating Data Quality from Data Fitness and rejecting the “fit for purpose” definition generated some discussion within the data community. After all, “Data Quality is fit for purpose” is a bedrock belief for many practitioners.
Having said all that, I do think that I can sharpen my argument a little more, and in doing so perhaps find common ground with the “fit for purpose” camp. We are ultimately saying the same thing, approaching from different directions where one is confusing and the other is clear (guess which is which).
Consider the atomic weight of Oxygen. We can measure it to seven significant digits, but the actual value for a particular atom starts to fluctuate after five. It is neither practical nor accurate to go beyond that. Five digits is sufficient, making 15.999 “fit for purpose.”
The same applies in manufacturing. The threads of a machine screw must have certain standard dimensions so that the nut will fit properly. The ISO 724 standard provides those dimensions, and a separate standard, ISO 954, defines the manufacturing tolerances. Viewed another way, the former gives the exact dimensional standard while the latter articulates “fit for purpose.”
I will concede that:
Data Quality must always be evaluated relative to a standard or set of requirements.
The purpose for which it is fit. Otherwise, we have no context for understanding or evaluating the metric. Therefore, Data Quality cannot be a stand-alone property of a data set.
“Q.E.D.!” you shout (mainly because you’re a mathematician, watched public television in Pittsburgh, or are familiar with semi-obscure Latin abbreviations). “The debate is over. You said so yourself.” Not quite. “Standard” and “set of requirements” are mostly synonymous, but the difference is subtle and important.
Some business concepts have standards set by external organizations like ISO, IEEE, IATA, USPS, etc. If such a standard exists, it should be adopted. The remaining corporate and organizational business concepts are standardized within a Business Glossary. They are proposed, agreed upon, and memorialized. Requirements, on the other hand, can differ for each project.
Here is where we approach the core issue, and where Data Quality and Data Fitness begin to converge.
Let’s return to the customer_zip_code example. Our company doesn’t have a Business Glossary. Field name is the only metadata available for most data elements. We need customer zip code. This looks like a good candidate, plus its Data Quality is high. So, we use it. And get unexpected results. How can that be? We look more closely at the requirements used to create the data and discover that only the first three digits are important to the consuming application. Quality is only evaluated over the first three digits. When we profile over five digits, we find that Data Quality drops to 64.9%. More often than not, the last two digits are ’00’ because the clerks are evaluated on how often they get the first three digits right, not all five. This will definitely not work for our application.
We incorrectly assumed that the quality of the zip code data element was measured relative to a standard (either USPS or Business Glossary) when it was actually measured relative to an application requirement.
What this all means for us practically is that:
Data Quality evaluation begins in the Business Glossary.
This allows us to be clear about Data Quality standards and specific application requirements.
Discovering that your application requirement differs from the standard contained in the Business Glossary should trigger a whole bunch of questions: What is different about it and why does it have to differ? Are your requirements more strict or less strict? Is it really the same business object? Are you using an existing object in a non-standard way? Do you need to reconsider?
The first step is determining whether the standard is fit for purpose, not the data.
Ultimately, Data Quality is in the eye of the beholder (which, again, makes the use of the word “quality” problematic). To increase clarity, we need to leverage existing standards whenever possible, clearly define data content standards in addition to the definition in the Business Glossary, and focus on Data Quality context before the metric.