This is the third part of a multi-part series discussing Data Quality and the ISO 25000 series standard.
OK. You’re with the program. You recognize that having quality data is important for accurate AI models. You would even concede that it’s important for other stuff, too, but whatever because AI is getting all the attention right now. You want to start publishing Data Quality metrics. All good.
You’ve done some research and read some articles and you’ve found a Data Quality framework. Maybe you’ve happened upon the ISO 25000 standard explored here the past couple weeks (Part 1, Part 2). You have all the quality metrics and their definitions, and you’re ready to start calculating them and publishing them and …
You hit a brick wall almost immediately, discovering that you don’t have the information required to calculate the quality metrics.
And it all falls apart.
The ISO 25024 Measurement of Data Quality standard defines twelve types of Quality Measure Elements that are used to calculate the Quality Measures:
- Number of accesses
- Number of attributes
- Number of data items
- Number of data values
- Number of elements
- Number of information items
- Number of metadata
- Number of records
- Number of times
- Size
- Time
Data Quality measurement assumes the existence of Quality Measure Elements as well as a supporting infrastructure for their collection and calculation that too often does not exist.
This is a nontrivial prerequisite activity. It’s great to have a standard with the measures and formulas and everything, but there’s a whole bunch of stuff that has to be done beforehand to even get to the point where these measures and formulas and everything can be used.
The absence of this information and infrastructure is the barrier to doing Data Quality Analysis and is the core of our Data Quality problem.
Or, perhaps more accurately, the core of our lack of Data Quality problem. Otherwise, all you have is great conversation about Data Quality and how important it is and how you need to measure it and all that. There’s a gap between where you are right now and where you need to be to even start doing Data Quality analysis.
We need a Data Quality Readiness metric that highlights the need for this preliminary work and measures progress toward the ability to calculate a desired set of Data Quality Measures.
Data Quality Readiness is defined as the ratio of the number of fully described Data Quality Measure Elements that are being calculated and/or collected to the number of Data Quality Measure Elements in the desired set of Data Quality Measures.
By fully described I mean both the “number of data values” part and the “that are outliers” part.
The first prerequisite activity is determining which Quality Measures you want to implement.
The ISO standard defines 15 different Data Quality Characteristics. I covered those in the earlier articles. They are then comprised of 63 Quality Measures. The Quality Measures are categorized as Highly Recommendable (19), Recommendable (36), and For Reference (8). This provides a starting point for prioritization.
Begin with a few measures that are most applicable to your organization and that will have the greatest potential to improve the quality of your data. The reusability of the Quality Measures can factor into the decision, but it shouldn’t be the primary driver. The objective is not merely to collect information for its own sake, but to use that information to generate value for the enterprise.
The result will be a set of Data Quality Measure Elements to collect and calculate. You do the ones that are best for you, but I would recommend looking at two in particular.
1) Number of Data Items.
It’s useful to have a framework or skeleton or scaffolding on which you can hang the information that you collect. The most effective approach is to collect entity and attribute information from files and structured repositories, including JSON/XML/ETC. The existence or descriptions of so-called unstructured data can also be captured.
The collection process can then be automated, much like a web crawler looking for objects to inventory, but it doesn’t have to be fancy. You can build your own repository, or use one of several great metadata hub products available. Most are either already integrated, or can be integrated into other applications and processes.
As a result, you’ll have a bunch of the denominators. Entire tables or databases or systems can be flagged as being in-scope for Data Quality measurement. You shouldn’t have to go field-by-field. From there, define and populate the attributes that determine the numerators. That brings me to the second item:
2) The Metadata Completeness Quality Measure Element.
When you’re starting out, the Completeness characteristic and the Metadata Completeness Quality Measure Element in particular is the most important. It measures the ratio of attributes with complete metadata within the data dictionary to the number of attributes for which metadata is expected. Of course, you’ll have to determine the “for which attributes metadata is expected” part.
Keep in mind that completeness is a bit of a misnomer. Metadata can never be “complete” because you can always capture something more. As the bare minimum, I consider “complete” to be having Expected Content. Of course, it’s hard to do Expected Content without Description, so include that one, too.
This gets you a long way. You can see now how important it was to collect metadata incrementally as part of the standard development process. If you’ve been doing that, great! If not, you’ve got some catching up to do. Time to get started!
The most practical approach may be to pursue two Data Quality tracks in parallel.
One track is the quick and dirty that you do because you’ve got some attention right now and you want to get some results right away. You don’t need to buy or install or implement anything. Just use your existing query and reporting tool to sample data in each table and summarize each data element. Yes, without a metadata repository bookkeeping can be a pain. If you plan ahead a little you can use database tables to store the results instead of a bunch of spreadsheets. This makes a great summer intern project.
The second track is a more structured Data Quality implementation initiative. Maybe using a metadata repository tool. Gather information about your information assets. Automate the processes and the analyses, and trigger notifications when anomalies are discovered. Use the ISO standard to define the dimensions to analyze. This will take longer to deliver, but it will be more scalable and more sustainable.
I’m writing this on Opening Day of the new baseball season. To employ an overused metaphor, many companies haven’t made it to first base. They’re not in the batter’s box. They’re not even on deck. They’re sitting on the bench in the dugout. Stand up, grab a bat, and get in the game.
Photo Credit: Minda Haas Kuhlmann, “It got late enough into the no-hitter that Teaford was alllll alone in the dugout.” Flickr.com. Some rights reserved.
0 Comments