Maybe we’ve been going about this all wrong.

It seems that everybody in the information management field continually emphasizes the importance of Data Quality, and everybody not in the information management field couldn’t care less. Maybe the latter group has it right.

After all, product is flowing out and revenue is flowing in. If the data is good enough to run the business, shouldn’t it be good enough for analytics? And weren’t you the one who said that the competition is moving quickly and we have to keep pace? 

The business needs to have access to data as quickly as possible.

This applies both to the availability of new data subject areas as well as the visibility of new data as it is created or received. There’s not time to spend writing data element definitions let alone figuring out what they’re supposed to contain. That would take minutes and our project backlog extends into the next county. And as a friend once wrote, “business moves too fast for warm fuzzies.”

OK. So, now the analysts have the data. At this point we’re not worried so much about the absolute Data Quality, as much as whether the data is fit for our purpose. But how are you going to know if you don’t have any data quality metrics or you don’t know what the data is supposed to contain? 

Maybe it doesn’t matter. A Vice President once told me that it didn’t matter to him whether the data used to generate his reports was accurate or not because he adjusted the numbers in his head anyway. He just needed for them to be “directionally correct.” After all, what are the consequences of generating reports or training models with inaccurate data? Well, it depends on the required accuracy of the report or model. In some cases, “directionally correct” may be sufficient. But predictive models that more often than not give customers incorrect projections can be embarrassing, and generating incorrect financial reports is unacceptable (and illegal). It’s all a standard risk mitigation analysis. 

Weigh the potential consequences of error in the analyses and models against the cost of knowing the quality of the data.

An additional side benefit of not worrying about data quality is that you don’t have to fix any errors you don’t know about.

A second factor to consider is whether you will be the one held accountable for errors in the data, reports, or models. You run your analysis. Get your result. Do whatever it is you’re going to do with it. Next. The world has moved on. We have to keep looking forward and moving forward. There’s not time to stop and quibble over this data or that value or whatever. Yes, the potential consequences to the company could have been catastrophic, but everything turned out OK. 

Our risk mitigation analysis must incorporate additional factors: the likelihood of having to deal with any adverse consequences and their potential magnitude.

Are you going to be promoted or rotated out of your position before anything bad happens? And if something bad does happen, how often have you been advised to “ask for forgiveness”? After all, when the problem becomes evident, you will mobilize your entire team to work nonstop nights and weekends and holidays to fix it. You’ll probably get promoted because of these heroic efforts.

Besides, how bad is our data, really? I haven’t heard about any problems with it. That’s probably because broad-based data quality statistics are very hard to come by. But a 2017 study published in Harvard Business Review found that “on average 47% of newly-created data records have at least one critical (e.g., work-impacting) error.”  And these were just eye-ball checks for obvious errors in critical data elements.

The legal profession provides an apt analogy: spoilation of evidence. This is when someone with the obligation to preserve evidence fails to do so. When that happens, the jury may be instructed to assume that the destroyed evidence would have been unfavorable to the party that had the obligation to preserve it. For us, this means that:

In the absence of evidence to the contrary, all data should be assumed to be incorrect.

I think it says a lot about the abysmal state of corporate data quality that analytics workbenches now incorporate rudimentary data profiling into their standard workflows. Get some value ranges. Check some joins. Filter out obviously anomalous values. Close enough. 

But for some situations acceptable is OK.

But for some analyses or models the data only needs to be mostly right. 

But how do you know how right, or not right, your data really is? Or do you not want to know?

Cover image, “See no evil speak no evil hear no evil” by Japanexperterna at flickr.com. Copyright © 2014, some rights reserved.