This is the first part of a multi-part series exploring Data Quality and the ISO 25000 series standard.
In the 1964 dark comedy Dr. Strangelove, or How I Stopped Worrying and Learned to Love the Bomb, General Jack D. Ripper orders a nuclear strike on the USSR. Despite efforts to recall the bombers, one plane successfully drops a bomb on a Soviet target, triggering the Doomsday Device. Once activated it cannot be disarmed and the result is global annihilation. The Doomsday Device was built as a deterrent, but it was activated before it was announced. Nobody knew about it. “The Premier was going to announce it on Monday. As you know, the Premier loves surprises.”
What if there was an international standard for Data Quality, but nobody knew about it?
There is: ISO 25012.
OK, it’s not even remotely the same thing. I get it. But when I first heard about the ISO 25012 standard, I was reminded of that scene.
The International Standards Organization (ISO) creates and publishes standards for quality, safety, and efficiency. For example, the ISO standards for screw threads (ISO 261-263) enable bolts made by one manufacturer to fit into nuts made by another. I suspect that many of you are familiar with the ISO 9001 standard for quality management.
Same folks.
In 2023, ISO published the 42001 standard that “specifies requirements for establishing, implementing, maintaining, and continually improving an Artificial Intelligence Management System (AIMS) within organizations. It is designed for entities providing or utilizing AI-based products or services, ensuring responsible development and use of AI systems.” You heard about that one? I hadn’t either until I went poking around the ISO website.
Individual countries then adapt the ISO standards to their specific needs. In the United States, that’s the responsibility of the American National Standards Institute (ANSI). In India, it’s the Bureau of Indian Standards (BIS). And so forth. Sometimes the ISO standard is adopted in toto. Sometimes not. A familiar example is signage in public spaces, like exit signs. The ISO standard uses pictograms that can be understood regardless of language. The ANSI standard requires the English word “EXIT.”
The full title for the ISO 25000 series is System and Software Engineering — Systems and Software Quality Requirements and Evaluation, abbreviated SQuaRE. Portions, some under different standard numbers and titles, have been under development since the 1980s and were consolidated into SQuaRE in 2005. Most have been updated since then, the most recent in 2024. For the record, in the United States, ANSI has fully adopted the ISO 25000 standard.
The standard consists of five quality-focused divisions—requirements, model, management, measurement, and evaluation—as well as an extension division that addresses specific application domains for a total of 20 different standards.
One challenge for practitioners is that it costs about $3000 to access all of the documents in the standard group.
This barrier to even seeing much of the standard also reminded me of the movie. If the objective is to get companies to conform to the standard, that doesn’t seem like a very good way to go about it. Fortunately, some of the details have been publicly published in conference presentations, white papers, and by country-specific standards organizations.
The Guide to SQuaRE provides an overview and roadmap for the standard. Fortunately, it can be downloaded from ISO for free. Well, more accurately, it can be purchased for zero Swiss francs. I won’t reproduce all the details here, but it might be worthwhile for you to read it.
From the Introduction:
The general goal of creating the SQuaRE set of International Standards was to move to a logically organized, enriched and unified series covering two main processes: software quality requirements specification and systems and software quality evaluation, supported by a systems and software quality measurement process. The purpose of the SQuaRE set of International Standards is to assist those developing and acquiring systems and software products with the specification and evaluation of quality requirements. It establishes criteria for the specification of systems and software product quality requirements, their measurement, and evaluation.
The focus of the standard (as its name suggests) is software quality, but highlighted as an innovation over its predecessors is the introduction of a data quality model. Finally, we’ve found our way to data, and to a pair of standards in particular:
25012: Data Quality Model
25024: Measurement of Data Quality
The Data Quality Model is made up of fifteen Data Quality characteristics, categorized as inherent, system dependent, or a combination of the two.
Inherent characteristics are applied to the data itself: domain values, business rules, relationships, and metadata. They describe the potential of the data to “satisfy stated and implied needs when the data is used under specified conditions.” I’m going to defer the question of whose specified conditions until a future article. It’s an important question but not one that’s critical to address right now.
Accuracy: The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use, both syntactically and semantically. Also included is data model accuracy.
Completeness: The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use. In other words, are all of the attributes populated and all of the records present. Also included is conceptual data model completeness. From an analytics perspective, completeness needs to be considered not just for a single data file, system, or table, but across the entire domain. For example, a company might have four different customer systems, each collecting a portion of the information about a customer. A data feed might be complete with respect to the particular source system, but we need to know how much of the entire domain is covered by the set of customer systems.
Consistency: The degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use. It can be either or both among data regarding one entity and across similar data for comparable entities. This includes semantic consistency, referential integrity, data value consistency, and data format and database type consistency.
Credibility: The degree to which data has attributes that are regarded as true and believable by users in a specific context of use. Credibility includes the concept of authenticity (the truthfulness of origins, attributions, commitments). This is referred to as Validity in other Data Quality models. This characteristic also considers the credibility of the data dictionary, data model, and data sources (i.e. Authoritative Sources).
Currentness: The degree to which data has attributes that are of the right age in a specific context of use. This is referred to as Timeliness in other Data Quality models. Although a characteristic of the data itself, it is strongly influenced by the source system implementation. For instance, a car passing through a toll reader generates a toll event instantaneously, but the system that processes those events might accumulate them throughout the day before processing them. Although potentially available immediately, from the perspective of downstream systems, “new” data can be up to twenty-four hours old. That is not a function of the data itself, but of the systems that process it.
The Inherent characteristics largely align with the DAMA-DMBOK dimensions of Data Quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. The only one that’s missing from the ISO standard is uniqueness, which is obliquely addressed by the Efficiency characteristic which I’ll talk more about later.
System dependent Data Quality characteristics, as the name suggests, describe the impact of computer systems components (e.g. hardware, software, network, repository) on the data. Quality is a function of the application implementation and the operational environment. For that reason, I consider these to be more Application Quality characteristics than Data Quality characteristics. Nevertheless, development teams should be encouraged to consider them.
Availability: The degree to which data has attributes that enable it to be retrieved by authorized users and/or applications in a specific context or use. From the systems perspective, this is usually referred to as uptime.
Portability: The degree to which data has attributes that enable it to be installed, replaced or moved from one system to another preserving the existing quality in a specific context of use. This is a big one when looking ahead to the possibility of changing your repository, development framework, or cloud provider. How many database-specific, application development platform-specific, or cloud provider-specific capabilities are you using?
Recoverability: The degree to which data has attributes that enable it to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use. This measures the success of backup and recovery processes. Today, data volumes make offline backups impractical and companies are increasingly choosing to deploy additional, geographically dispersed instances of the data instead, perhaps in different cloud sites or data centers.
The remaining Data Quality characteristics share features that are both inherent and system dependent. Most are of the form “The data supports X” and “The technology supports X.” More or less. The overlapping Data Quality characteristics having this pattern are:
Accessibility: The degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability. I understand the reasons for wanting to measure this, but it is not a function of the quality of the data. For example, a screen reader cannot (yet) interpret an image. That doesn’t mean, though, that the image data is low quality. The limitation is inherent to the problem itself. Not supporting assistive technology like a screen reader so that textual data is accessible to the visually impaired is an application design or requirements issue.
Compliance: The degree to which data has attributes that adhere to standards, conventions or regulations in force and similar rules relating to data quality in a specific context of use. Regulations may require that data have certain values or formats, perhaps for data interchange between entities. Business rules may also need to be incorporated into the applications that enforce compliance.
Confidentiality: The degree to which data has attributes that ensure that it is only accessible and interpretable by authorized users in a specific context of use. At this point, encrypting personal and confidential information and protecting computer systems from unauthorized access should be standard operating procedure.
Precision: The degree to which data has attributes that are exact or that provide discrimination in a specific context of use. Think significant figures from high school chemistry. Does the data have the right decimal format and does the application use the right decimal format. I believe, though, that this should be extended to incorporate ontological granularity, which is just a fancy way of saying that needing to know the location of a specific rooftop is different from needing to know the location of a country. Rooftop is greater precision than country.
Traceability: The degree to which data has attributes that provide an audit trail of access to the data and of any changes made to the data in a specific context of use. This has two parts. The first is logging user access. The second is logging data item value changes. With the latter, it would be easy to make the leap directly to Lineage, but this actually refers to keeping a history of data element values. It’s like what we do for slowly changing dimensions. I suppose if the data changes as it moves between systems, then some knowledge of lineage is implied. But Lineage as it is most commonly understood in Data Governance is not included in the standard.
I would recommend that it be added as a system-dependent characteristic of Traceability.
Call it Tra-D-2 for those of you following along with the standard at home. Data movement traceability could be defined as “The possibility to trace the history of the movement of a data item between applications and systems using system capabilities.”
The last two, Efficiency and Understandability, warrant a little more discussion and I’ve gone on long enough this week. I’ll cover those in the next installment in this series.
0 Comments