It all seems so easy.

Companies are recognizing that the accuracy of artificial intelligence and machine learning applications is directly related to the quality of the data used to train the models. Obviously, you want to improve the quality of your company’s decisions, so it makes sense to improve the quality of the data that informs those decisions. Cue Data Quality.

Great! 

Even better, everybody knows how to do that: select a data set, examine its contents, and identify any errors and inconsistencies. Myriad tools can run the profiles and report the results. It’s even a great summer intern project.

And companies that have not yet crossed the Data Chasm have already set themselves up for failure.

And nobody has even been asked to fix anything yet.

This small, simple process makes a very large, nontrivial assumption: that we know the expected content of the data.

A simple query can tell you that a data element is populated most of the time and contains the letters ‘A’ through ‘J’ distributed roughly evenly. But that simple query cannot tell you whether those are the values that the data element is supposed to contain. It cannot tell you whether that’s the expected distribution. And it cannot tell you whether the data element must always be populated. Without those details, your Data Quality efforts will be fruitless.

Knowing the expected content of a data element is at least as important as knowing its definition.

This statement might be a little bit unexpected. Many would assert that the expected content is part of the definition, but I consider it to be sufficiently critical that it deserves special attention. Specifying the expected content requires a level of precision that is too easily glossed over in a descriptive definition. It is that precision which is required to evaluate Data Quality. (As an extra added bonus, it is that precision which also better informs application development and test suites, and ultimately results in fewer errors.)

Some of you might recall Extreme Programming. One of its core tenets was that the two authoritative pieces of documentation for an application are its source code and the test cases used to validate it. In this analogy, the source code can be thought of as the data profile and the test cases can be thought of as the expected content. It’s not that other documentation isn’t useful. It can provide context and background to help facilitate understanding and utilization, but at the core there’s what the program does and what the program is supposed to do. What the data is and what the data is supposed to be.

OK. We recognize that the success of our new cutting-edge applications depends upon our understanding of the data content and the simple data profiling process. Awesome. You’d think, then, that companies would be profiling their data all over the place. History suggests, though, that this is not the case. Remember the big assumption? Typically, only a very, very small fraction of corporate data is understood well enough to be profiled. 

More often than not, data profiling efforts fail because nobody can authoritatively say what the data is supposed to contain. 

Furthermore, if errors and discrepancies are found, they are very rarely appropriately corrected.

Let’s start with the second point first. Most experts agree that the team responsible for the analytical environment, whether a data warehouse, data lake, or whatever, should not be responsible for cleansing the incoming data. Once you start down that road, you’re going to spend all your time chasing source system errors. You’re going to have to reproduce (or at least validate) the same business rules that were supposed to have been implemented in the source application.

We know that correcting a data problem at its source minimizes the effort required to fix it. All downstream consumers can then benefit from that effort. But when the analytics people show up with a data issue, the response from that development team is often something like, “Thank you very much for your feedback. We will take it under advisement. Put your request into our backlog and we’ll get to it never.” 

After all, development teams produce applications that implement business processes and capabilities. The demand backlog is always growing, and managers are pressured to deliver more capabilities in less time with fewer people. Requirements and features are collected. Code is written. Test cases are evaluated. And when everything checks out, the application is released and the team moves on to the next one.

It may truly be an application defect, but if it was known it wasn’t severe enough to correct before release, and if it is new it’s apparently not severe enough to impact operations. After all, product is moving out and revenue is flowing in. This data thing is just a distraction. And given the choice between delivering a new business capability or customer feature and remediating a data problem, we’re going to go with the new features.

But could the development teams or their business partners at least tell us what the data is supposed to contain? The same answer is given then, too. And why would they even take the time to do that? At best we’re going to tell the application teams what they already know, or at least believe: that the data is correct. At worst, we’re going to give them more work to do. 

The incentives and interests of the application development teams, and often their business partners as well, are completely misaligned from information management.

Not only is there no incentive for an application development team to correct data problems, there’s no incentive to even participate in the discovery process. Both only generate more work and distract from business capability delivery.

So, do we throw up our hands and declare defeat? Of course not. Next time we’ll look at what we can do to start to bridge the Data Chasm.

This is the second in a series of articles that explores the question of why we continue to see overwhelming numbers of analytics, artificial intelligence, machine learning, information management, and data warehouse project failures despite the equally overwhelming availability of resources, references, processes, SMEs, and tools…and what can be done about it.