AI-Driven Data & Analytics Disruption: Data Quality and Metadata Collection

We are at the threshold of the most significant changes in information management, data governance, and analytics since the inventions of the relational database and SQL.

At their core, these areas have changed very little over the past thirty years. Most advances have been beneficiaries of Moore’s Law: faster processing, denser storage, and greater bandwidth. The basic analytics architecture, though, remains the same. Source systems move data into a centralized repository (or set of repositories) which provide data to downstream data marts and consumers. Doesn’t matter if it’s a single enterprise data warehouse in the data center or a multi-technology ecosystem in the cloud. Batch or streaming. It looks the same.

This is about to change. It’s already starting

Recent advances in Artificial Intelligence are driving real information management change.

Generative AI for Data Management entered the Gartner Hype Cycle for Data Management in 2023. The following year it had moved up slightly but was still the “first” item on the Innovation Trigger. The anticipated time to Plateau was given as 5-10 years but I expect it to be sooner than that.

In this article and the next, I’ll touch briefly on a couple areas where the impact of AI on information management is being seen, or where I expect to see it shortly.

Data Quality

This one is everywhere. Companies are discovering that poor Data Quality, and the poor Data Governance that enabled its use, results in underperforming AI models. I illustrated the effect of Data Quality on AI model accuracy in an earlier blog article.

The recognition of the need for high quality data to train AI models is largely driving the resurgence of interest in Data Quality and Data Governance.

Perhaps leadership didn’t know to ask the question, or simply assumed that their company’s data was clean; or at least clean enough to use for this shiny new AI stuff. After all, the company runs on that data. Product is moving and money is flowing. Perhaps leadership suspected that the data had problems but didn’t want to know about it. Plausible deniability. Again, the company is running fine. Don’t rock the boat. The development teams are busy enough already. But whether the ignorance was accidental or intentional, the spotlight is on the data. Expectations of data correctness are greater today than ever before, and will continue to increase.

Data Quality analysis requires the understanding of expected data content and the observation of actual data content. It’s only a matter of time before AI is applied to both ends of the Data Quality equation, but I’m not sure it’s necessary. At least not directly. It’s ironic because AI is driving the overwhelming majority of the present interest in Data Quality, but Data Quality scoring, pattern identification, and anomaly detection don’t require it. Just look at what’s there. Sum and Group By. Basic statistics. You can assign the task to a summer intern. Start now if you haven’t already.

AI could be applied to cleansing, or at least recommending changes, but the Data Owners will want to review the recommended changes before they’re made, at which point they become deterministic.

One aspect of Data Quality where AI can be useful is in metadata collection. Definition and expected content. I’ll talk about that next.

Metadata Collection

Everybody knows they need to do it. Nobody likes doing it. So, nobody does it. Or at least comparatively few. And as a result, we have an epidemic of business decisions that rest upon data that nobody knows what it means or what it’s supposed to contain. It’s the primary barrier to really making your company’s data and analytics practice into a competitive differentiator. It’s the primary difference between the 80% of AI projects that underperform and the 20% that succeed.

The Holy Grail of metadata collection is extracting meaning from program code: data structures and entities, data elements, functionality, and lineage.

For me, this is one of the most potentially interesting and impactful applications of AI to information management. I’ve tried it, and it works. I loaded an old C program that had no comments but reasonably descriptive variable names into ChatGPT, and it figured out what the program was doing, the purpose of each function, and the description of each variable.

Eventually this capability will be used like other code analysis tools currently used by development teams as part of the CI/CD pipeline. Run one set of tools to look for code defects. Run another to extract and curate metadata. Someone will still have to review the results, but this gets us a long way there.

The other possibility is to analyze the running application to determine expected content. “That’s cheating!” you say, “You’re just looking at the application data and saying that’s the expected content.” Yes, that would be cheating. The idea, though, is to derive meaning from context. Is the data content expected or unexpected within that context? Again, someone will still have to review the results, but this moves a long way forward from the starting point of not doing it at all.

Next week I’ll continue with Data Modeling and Analytics, and explore the democratization of information management activities through the use of AI.

AI-Driven Data & Analytics Disruption: Data Quality and Metadata Collection

Published by Mark on May 7, 2025May 7, 2025

0 Comments

Leave a Reply Cancel reply

Cooper’s Conjectures or The Future of AI Might Look Familiar

Third Recommendation: Stop Using Data

Assessing Data Quality Readiness

AI-Driven Data & Analytics Disruption: Data Quality and Metadata Collection

Published by Mark on May 7, 2025May 7, 2025

0 Comments

Leave a Reply Cancel reply

Related Posts

Cooper’s Conjectures or The Future of AI Might Look Familiar

Third Recommendation: Stop Using Data

Assessing Data Quality Readiness