This is the second installment of a two-part series about the impact of AI on information management. The first covered Data Quality and Metadata Collection.

Data Modeling

Nobody at your company is more passionate about understanding the data than your Data Modelers. Unfortunately, too often their work products, while admired by other Data Modelers, are largely ignored by everyone else. But understanding the data entities and the relationships between them is part of understanding the data. Those relationships are the threads that make up the Data Fabric.

In many organizations, these folks are considered a luxury item and are often jettisoned or reassigned when budgets get tight. This doesn’t have to be the case. Resources, both old and new, can be leveraged to increase the efficiency of your existing modelers. 

Nobody should have to develop a data model from scratch. 

Don’t start over. Leverage resources that you already have at your disposal.

Your company almost certainly has a library of models lying around from various past initiatives. Start there. Company or organization-specific business knowledge will have already been integrated into them. No need to plow the same ground again.

Industry-focused models have been around for decades. Mature models for finance, transportation, telecommunications, retail, and many others can be found online or purchased. They have been developed in conjunction with a cross-section of companies within that industry, and are almost always very well documented. Therefore, they represent something of a least common denominator, trying to be applicable to the widest cross section of organizations. Customization will be necessary.

Large Language Models can already ingest information about the company and/or industry and spit out a model. I recently asked ChatGPT to generate a logical data model for a passenger airline reservation system. In about ten seconds it spit out a nicely formatted and documented set of entities, attributes, and relationships. It was mostly right. Mostly.

None of these resources, not even AI, will get you all the way there. Eighty percent of the way there, maybe, but not all the way. The deficiencies are apparent if you know the business and you know what you’re looking for.

Company-specific and domain-specific knowledge and context are still needed.

John Ladley and I talked about this with Laura Madsen in the Rock Bottom Data Feed podcast episode, The Fuss About Data Governance Disruption. This company and domain-specific knowledge is the “secret sauce” that differentiates organizations. Instead of having a team of less-experienced modelers with a senior modeler that reviews their work, the large language model is that team. The people can focus on the details and idiosyncrasies of their organization and their business.

Analytics

The pace of natural language understanding progress has been pretty constant for many years. Recently, Large Language Models have produced incredible improvements.

Large Language Models can be applied in analytics a couple different ways. One is to generate the answer solely from the LLM. Start by ingesting your corporate information into the LLM as context. Then, ask it a question directly and it generates an answer. Hopefully the correct answer. But would you trust the answer? Associative memories are not the most reliable for database-style lookups. Imagine ingesting all of the company’s transactions then asking for the total net revenue for a particular customer. Why would you do that? Just use a database. I discussed this scenario last time.

Another is for the LLM to generate a SQL query that retrieves the answer from a database or other repository. Here, we begin by ingesting a database structure and metadata. The LLM could be asked the same question, but in this case it generates the SQL query that interrogates the database. Maybe it’ll even run the query for you. The critical difference is that the data from which the results are produced live in a database (or other repository), not in an associative memory. Of course, it’s also important to have the SQL statement to confirm the correctness of the LLM-generated query.

In this scenario, the LLM is a translator and interpreter, discerning what you’re asking from your prompt.

This has long been my vision for analytics interfaces. More than twenty years ago, I proposed a data warehouse interface that was basically a Google search box.

I recently ran this experiment, too, ingesting a database schema into ChatGPT and asking it a question. It was able to handle straightforward queries easily, but as the requests got increasingly complicated, the resulting queries got increasingly incorrect. 

Just as AI can only get your logical data models eighty percent of the way, they can only get your SQL queries that far, too. You still have to understand SQL to confirm and troubleshoot. You still need an understanding of analytical functions and AI algorithms: how to use them, when to use them, what the results mean, and how they can be misused.

The combination of natural language query and automatic code generation can accelerate ETL development and Data Fabric implementation. I’ve tried this one, too, with similar results. The LLM takes you most of the way, but you still have to validate the application to carry it across the finish line.

Democratization

In the beginning, reporting and analytics required arcane data repository and mainframe programming expertise. The few employees with those skills were consolidated into an MIS department that received data requests, developed applications, produced results, and returned reports. In the 1990s and 2000s, the Data Warehouse democratized corporate information access by making data available in a central repository, accessible through SQL queries and tools that helped construct those queries. SQL and BI Query were much easier to learn than COBOL.

Over time, as a technology matures, more and more people have access to its benefits and the barrier to entry is lowered.

That continues today. Many of the data and analytics activities that had previously required specialized training, experience, and expertise have now been democratized. Data repositories and tools continue to become more and more intuitive. More and more people can now extract value from corporate information resources.

Remember Data Science unicorns? Those rare individuals who were at the same time Ph.D. statisticians, domain experts, skilled communicators, and ninja application developers. About a decade ago it seemed that every company was looking for them. It seemed that every college was establishing a Data Science concentration, certificate, or degree program. When it became apparent that very few of those people actually exist, most companies moved toward Data Science teams having those skills in aggregate. Now, AI is democratizing Data Science even farther.

Unicorns are no longer required, but business knowledge is critical. Understanding the data is critical.

As the level of user sophistication decreases, the more likely users are to misinterpret or misuse data, especially if it is not well understood. More hand-holding is also needed. A baseline level of business knowledge and resource utilization proficiency is required, but that is only a start.

What happens when complexity or novelty increases? What about when troubleshooting or fine-tuning is required? You need more skill than baseline. Oftentimes much more.

Anyone can take pictures, shoot videos, and record audio with their smart phone. Do you color correct and color grade your videos? Do you equalize and normalize your audio recordings? I’m sure that there’s somebody on YouTube that does all of their content creation on their phone, but the difference between amateur and professional is often obvious.

The point is that democratization doesn’t solely mean eliminating jobs. The people will still be necessary. It’s about evolving roles. It’s about the people understanding the data and the business and automating as much of the implementation as possible. This maximizes the value that the organization gets from both the employees and the technology. Working together.

Image Credit: Don DeBold, “NEC 2203 Mainframe Computer,” 2011. Flickr.com. Some rights reserved.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *