Everybody likes a pithy definition. Marketers describe them as “sticky,” meaning that they’re easy to remember. Of course, that doesn’t always mean they’re useful or completely accurate. Information Management has a couple. Metadata is almost universally described as “Data About Data,” but I’d be willing to bet that you rolled your eyes just now. How many times have we seen Metadata introduced in that way, with the presenter or author immediately apologizing and moving on to a more useful description.
Similarly, the Data Quality bumper sticker reads “Fit for Purpose.” You can probably already guess that I’m not a fan. Let’s pull out our DMBOK and see what it says:
The term data quality refers both to the characteristics associated with high quality data and the processes used to measure or improve the quality of data. [DMBOK-2, 644]
OK so far. It then goes on to explain what is meant by “high quality data”:
Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes to which they want to apply it. It is of low quality if it is not fit for those purposes. Data quality is thus dependent on context and on the needs of the data consumer. [DMBOK-2, 644; emphasis added]
Far be it for me to challenge the accumulated knowledge of our field, but I very strongly disagree with that statement. Quality, generally speaking, is the degree of conformity to a defined standard. Many fields have entire organizations that develop and publish standards. They can be international like ISO, country-specific like LEED, or local like construction codes.
Data is a little different because it is created and consumed by computer programs supporting specific business functions and capabilities, or representing real-world objects and events. There are infinite variations, and only rarely does an external standard exist against which the accuracy of the data can be measured (for example, a list of standard IATA airport codes).
So, doesn’t that take us back to “fit for purpose”?
Perhaps, but we need to be clear about whose purpose and for what requirements, established when and by whom.
Imagine you’re looking to purchase a used car to get you back and forth to work. You don’t have a lot of money, but you only have to drive a couple miles each way. You find a car that’s extremely inexpensive, but the engine overheats after running for about a half hour. You buy it despite the engine problem because it satisfies your requirements: really low price and five-mile commute. It is fit for your purpose, and therefore by the DMBOK definition it is “high quality.”
One day, you want to visit family a couple hundred miles away. You set out in your “high quality” car and haven’t even completed 10% of the trip when you have to stop and let the engine cool down. At this rate it’s going to take three days to travel the two hundred miles. You curse this piece of junk. The car is now “low quality” because it does not satisfy the new purpose to which you wanted to apply it.
The car was evaluated as both high quality and low quality even though nothing about the car changed.
It was your perception of the car’s quality relative to a new purpose that changed.
Within the context of the DMBOK definition, every consumer evaluates the quality of a data set differently and independently. Data is considered to be of high quality when it is fit for my purpose, satisfying my requirements, defined by me when I need the data.
Furthermore, defining and assessing data quality in this way makes it difficult for these quality analyses to be leveraged by new consumers having different contexts and different needs. Data Quality truly, defined in this way, is in the eye of the beholder.
This is not Data Quality. It is Data Fitness.
The DMBOK doesn’t recognize Data Fitness as a specific knowledge area but mentions it as part of Data Profiling.
Assessing the fitness of the data for a particular use requires documenting business rules and measuring how well the data meets those business rules. [DMBOK-2, 418; emphasis added]
But this sounds an awful lot like “data is of high quality to the degree that it meets the expectations and needs of data consumers.” It seems like quality and fitness are being conflated.
And confused.
I’m confused.
Let’s go back to the data headwaters: the customer for whom the data was created in the first place. The needs and utilization context for that customer were:
- expressed in their requirements, epics, features, and / or user stories,
- captured in the data definitions, expected content, and other quality dimensions, and
- implemented in the application.
The needs of additional downstream consumers known a priori could also be considered, but most of these uses and users emerge after an application is deployed.
This original set of requirements is the only standard against which Data Quality should be measured. This allows us to definitively answer the questions of whose purpose, what requirements, established when, and by whom.
Data Quality is the degree to which data conforms to the requirements for which it was created (definition, expected content, and other dimensions).
We know how to do that. The DMBOK lists several Data Quality dimensions, each with objective measures. Other sources identify dozens more. But the standard is clear.
Now, the definition of Data Fitness also becomes clear.
Data Fitness is the degree to which data conforms to the requirements for which it is being considered for use.
Obviously this includes the original customer, and maybe that original set of requirements was what the authors of the DMBOK had in mind when they were writing about Data Quality.
But Data Fitness is also evaluated by each new potential consumer. The question being asked is, “Does this data set satisfy my needs?” Not, “Is this data of high quality?”
We know how to do that, too.
Finally, downstream consumers can request upstream application changes to accommodate their requirements. Not to improve quality. This might at least partially explain why development teams are less than excited to hear from us when we approach them with a “data quality” issue related to our expectations, not their requirements.
I hate to introduce (or reintroduce) vocabulary into a field that drops new terms like a hay bailer, but I believe that it is worthwhile more clearly differentiating between Data Fitness and Data Quality. Each has a different meaning and purpose. Each is a separate knowledge area.