This is the second part of a multi-part series discussing Data Quality and the ISO 25000 series standard.

The first installment in this series introduced the ISO 25000 series standard (System and Software Engineering — Systems and Software Quality Requirements and Evaluation) with a particular focus on ISO 25012 (Data Quality Model) and ISO 25024 (Measurement of Data Quality). The standard defines fifteen data quality characteristics, of which thirteen were covered. That leaves two: Efficiency and Understandability. Both have both inherent and system dependent features. Let’s dive right in.

Efficiency: The degree to which data has attributes that can be processed and provide the expected levels of performance by using the appropriate amounts and types of resources in a specific context of use.

This one is particularly interesting (to me at least) in that it seeks to measure the impact of design decisions on storage, processing, and user experience.

The inherent perspective focuses on syntactic and semantic ease of use. Numeric values stored as strings are harder to use than numeric values stored as integers or floating-point decimals. Distances stored in miles are harder to use in a country that uses the metric system than distances stored in kilometers. 

Usability can be evaluated by comparing the time it takes experienced and novice users to complete the same task, but for the most part the definitions of “efficient” and “efficiency” are subjective. The potential for defining a set of objective guidelines is mentioned but not fleshed out.

The system dependent perspective focuses on space, processing, and time. Text stored in fixed-length strings usually require more space than text stored in variable-length strings. Numeric values stored as strings must be converted before mathematical operations can be applied.

Guidelines that inform most of these design decisions have been known to DBAs for decades, but they are not explicitly described in the standard. 

The latency in data movement between systems is also considered part of Efficiency. Unlike the Currentness characteristic that measures latency relative to a business requirement,  here, latency is simply measured as a proxy that aggregates efficiency across all components: data architecture, network, application, repository, etc. The lower the better.

If you’ve been keeping track of your traditional Data Quality dimensions, you may have noticed that one is missing from the ISO 25012 standard: Uniqueness. While not explicitly addressed, quantifying duplicate records is an Efficiency quality measure. For example, avoiding storing the same product multiple times, each with different identifiers or descriptions.

While the actual values for the quality measures can be calculated or observed, the benchmarks against which they are compared are mostly subjective. Someone has to decide whether the space or processing or time consumed was efficient or unnecessary.

When considering efficiency, it is critical to recognize that independently optimizing individual measures can reduce the efficiency of the system as a whole. 

Minimizing space consumed may increase the processing required and complexity. The metrics all look good but the system performs poorly and is hard to use.

These trade-offs are similar to those considered in data modeling. A pedantic data modeler may require strict third normal form in the logical database model. In practice, though, implementing that logical schema physically may require more repository space, processing, and network consumption. Plus, as an added bonus, you end up with increased complexity and difficulty for data consumers. 

Finally, I would add one more perspective to consider: the use of summarization, aggregation, pre-calculation, and denormalization, especially for commonly used metrics. The space required may increase, but that would be offset by reduced runtime resource consumption and greater ease of use.

Efficiency optimization is both science and art.

That brings us to our final quality characteristic:

Understandability: The degree to which data has attributes that enable it to be read and interpreted by users, and are expressed in appropriate languages, symbols and units in a specific context of use. Some information about data understandability are provided by metadata. 

If you’ve read any of my blog articles, or really pretty much anything I’ve written or presented about data and analytics, you know that I believe that Understandability is fundamental.

The standard first addresses the understandability of the symbols, character set, and alphabet used to represent data. This is critical for multi-language support. 

Next comes the understandability of data elements. Ensuring that metadata is defined for all data elements is a measure within the Completeness characteristic. Interestingly, the understandability of that metadata is not measured for all data elements, but only for master data which is defined as, “data held by an organization that describes the entities that are both independent and fundamental for an enterprise that it needs to reference in order to perform its transaction.”

This begs the question: Which data elements do we not need to understand?

If the data is being stored, then it is being consumed. And if the data is being consumed, then the consumer needs to understand what it means. (And if the data is not being consumed, then it does not need to be stored.)

What is measured across the board, though, is semantic understandability: the ratio is the number of data values defined in the data dictionary using a common vocabulary (read: business glossary) to the number of data values defined in the data dictionary. This essentially evaluates the completeness of the business glossary and how well it is being used.

Finally, the standard includes several subjective measures of the understandability of data values, data models, and data representations. All are quantified through interviews and questionnaires, or by “counting the number of users’ complaints.”

At its core, Understandability is a measure of metadata quality.

That covers all of the data quality characteristics and the quality measures associated with them. In the next installment of this series, I’ll drill into evaluating compliance. 


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *