The Data Chasm: Complete

For those who don’t want to click from page to page, here are all five articles in the Data Chasm series. In it, we explore the question of why we continue to see overwhelming numbers of analytics, artificial intelligence, machine learning, information management, and data warehouse project failures despite the equally overwhelming availability of resources, references, processes, SMEs, and tools…and what can be done about it.

Part 1: Groundhog Day

Data is back in the corporate limelight. Again. Seems we’ve been here before. In years past it’s been data warehousing, metadata, big data, and advanced analytics. “Data Driven” has been the “new” buzzword for more than a quarter-century. Now it’s artificial intelligence and machine learning. Management is recognizing that Data Quality is required to produce quality AI and ML models. For data professionals, this is another opportunity to leverage executive attention to drive Information Management progress.

So, we dutifully revisit our Data Governance and Data Quality plans. We get some new books and read some new articles. We package up a comprehensive step-by-step method, get head nods, and start to implement. But problems emerge almost immediately. It’s too hard. It takes too long. We don’t have the resources. So, we ask the boss to help clear the path. And instead of making progress we get thanks for the good work but right now might not be the right time. The plans return to the shelf and the project team is dispersed. Again. We know the benefits that the effort could have. Again. We know the issues that could be resolved. Again. We know that development could be accelerated, and errors reduced if only…

Again.

This lack of progress is certainly not the result of a lack of knowledge or resources. So many experts, instructors, mentors, and practitioners willing to share. We have professional organizations, references, vendors, software products, process templates, subject matter experts, consultancies, dozens of books, thousands of articles and white papers, and innumerable PowerPoint presentations and strategic plans. The technology is getting better. The processes are getting better. AI is being applied.

We know what to do and we know how to do it. Most everybody understands that it’s important and recognizes the value. You would think that Information Management would be thriving everywhere. Yet, that’s not the case. We’re still fighting the same battles and we’re still making the same arguments twenty-five years later. And we’re still seeing the same failure rates.

Why?

Before we can sustainably realize the benefits of Information Management, we must first have a basic understanding of the data. And the most basic understanding requires that we know two things:

1. What the data element means.
2. The values that it’s supposed to contain.

In other words, its definition and its expected content. Without those, you can’t do anything else, or at least not easily, sustainably, or at scale.

Too often we bypass Data Understanding and jump directly to Data Quality or Data Security or Data Analysis. But Data Quality requires a standard against which to measure variance in the actual data content. Data Security and Data Privacy processes assume that enough is known about the data to assess risk. Artificial intelligence and machine learning are two of today’s most exciting and potentially impactful technologies, yet take for granted the existence of a foundation of Data Understanding that is often not being built. And without that foundation these efforts are likely to fail. Models trained using misunderstood data will yield unexpected, potentially misleading, and likely incorrect results.

In short, the benefits of Information Management lie at the far side of a chasm which many organizations have yet to cross. The results can be seen in the high failure rates of data warehouse, analytics, and AI/ML projects. It is a spectacularly poor track record which is too easily accepted as “normal.”

I do not believe that we must accept that.

Data Understanding is the key to success. It’s the fuel that accelerates application delivery, business and operational analytics, and AI/ML model development. It enables faster responses to changing market conditions. It facilitates communication between development teams and business units.

A company’s most valuable asset is its understanding of its data.

In this series we will explore the challenges implementing and widely adopting Information Management despite the comprehensive foundational work and ample reference resources. Several barriers to Data Understanding will be examined, along with recommendations for building bridges to cross the Data Chasm.

Part 2: Conflicting Interests

It all seems so easy.

Companies are recognizing that the accuracy of artificial intelligence and machine learning applications is directly related to the quality of the data used to train the models. Obviously, you want to improve the quality of your company’s decisions, so it makes sense to improve the quality of the data that informs those decisions. Cue Data Quality.

Great!

Even better, everybody knows how to do that: select a data set, examine its contents, and identify any errors and inconsistencies. Myriad tools can run the profiles and report the results. It’s even a great summer intern project.

And companies that have not yet crossed the Data Chasm have already set themselves up for failure.

And nobody has even been asked to fix anything yet.

This small, simple process makes a very large, nontrivial assumption: that we know the expected content of the data.

A simple query can tell you that a data element is populated most of the time and contains the letters ‘A’ through ‘J’ distributed roughly evenly. But that simple query cannot tell you whether those are the values that the data element is supposed to contain. It cannot tell you whether that’s the expected distribution. And it cannot tell you whether the data element must always be populated. Without those details, your Data Quality efforts will be fruitless.

Knowing the expected content of a data element is at least as important as knowing its definition.

This statement might be a little bit unexpected. Many would assert that the expected content is part of the definition, but I consider it to be sufficiently critical that it deserves special attention. Specifying the expected content requires a level of precision that is too easily glossed over in a descriptive definition. It is that precision which is required to evaluate Data Quality. (As an extra added bonus, it is that precision which also better informs application development and test suites, and ultimately results in fewer errors.)

Some of you might recall Extreme Programming. One of its core tenets was that the two authoritative pieces of documentation for an application are its source code and the test cases used to validate it. In this analogy, the source code can be thought of as the data profile and the test cases can be thought of as the expected content. It’s not that other documentation isn’t useful. It can provide context and background to help facilitate understanding and utilization, but at the core there’s what the program does and what the program is supposed to do. What the data is and what the data is supposed to be.

OK. We recognize that the success of our new cutting-edge applications depends upon our understanding of the data content and the simple data profiling process. Awesome. You’d think, then, that companies would be profiling their data all over the place. History suggests, though, that this is not the case. Remember the big assumption? Typically, only a very, very small fraction of corporate data is understood well enough to be profiled.

More often than not, data profiling efforts fail because nobody can authoritatively say what the data is supposed to contain.

Furthermore, if errors and discrepancies are found, they are very rarely appropriately corrected.

Let’s start with the second point first. Most experts agree that the team responsible for the analytical environment, whether a data warehouse, data lake, or whatever, should not be responsible for cleansing the incoming data. Once you start down that road, you’re going to spend all your time chasing source system errors. You’re going to have to reproduce (or at least validate) the same business rules that were supposed to have been implemented in the source application.

We know that correcting a data problem at its source minimizes the effort required to fix it. All downstream consumers can then benefit from that effort. But when the analytics people show up with a data issue, the response from that development team is often something like, “Thank you very much for your feedback. We will take it under advisement. Put your request into our backlog and we’ll get to it never.”

After all, development teams produce applications that implement business processes and capabilities. The demand backlog is always growing, and managers are pressured to deliver more capabilities in less time with fewer people. Requirements and features are collected. Code is written. Test cases are evaluated. And when everything checks out, the application is released and the team moves on to the next one.

It may truly be an application defect, but if it was known it wasn’t severe enough to correct before release, and if it is new it’s apparently not severe enough to impact operations. After all, product is moving out and revenue is flowing in. This data thing is just a distraction. And given the choice between delivering a new business capability or customer feature and remediating a data problem, we’re going to go with the new features.

But could the development teams or their business partners at least tell us what the data is supposed to contain? The same answer is given then, too. And why would they even take the time to do that? At best we’re going to tell the application teams what they already know, or at least believe: that the data is correct. At worst, we’re going to give them more work to do.

The incentives and interests of the application development teams, and often their business partners as well, are completely misaligned from information management.

Not only is there no incentive for an application development team to correct data problems, there’s no incentive to even participate in the discovery process. Both only generate more work and distract from business capability delivery.

So, do we throw up our hands and declare defeat? Of course not. We start to build the bridge across the Data Chasm.

Part 3: Square 1

This tale may sound familiar to companies that have not yet crossed the Data Chasm (and maybe even to some that have).

A project team was modernizing a major operational system. One that had been written decades earlier. One where nearly all the developers and subject matter experts with knowledge of the applications, data, and business rules had long since left the company or retired. One where the source data had not been clearly documented.

Nobody knew where the data came from, what it meant, what it contained, or how it was calculated. All the team could do was to try to reverse-engineer the data through the applications that generated it. Such a Data Forensics exercise is tedious, difficult, often aggravating, and always a waste of time.

What message do you suppose the team would send to the previous generation of developers if they could go back in time?

They would impress upon them the importance of documenting the data. (Besides making the present-day modernization faster and easier, it would also have made that data more useful and usable, and thus more valuable, in the interim.)

Yet, despite this hard-earned insight and experience, this generation of developers turned out to be no better at documenting their data than the previous one. They are doing the same thing to the next generation of operational system developers that was done to them.

The cycle of data neglect continues.

Information management is sometimes sold as an “investment”: put in a little extra work today that will pay greater dividends for years to come. And I do believe that development and business teams recognize the benefits. Yet, as release dates approach, information management is still the first thing jettisoned in the interest of timelines and deliverables.

We therefore cannot appeal to logic because the development and business teams already recognize the benefits but that still wasn’t enough to motivate action.

Demand exists for information management, or at least the products of information management. It’s just that nobody wants to do it.

And that’s oddly encouraging.

We see this pattern elsewhere. You want to perform the song, but you don’t want to practice the instrument. You want to lose weight, but you don’t want to change your eating habits. You want to run the race, but you don’t want to train. You want the application to function properly, but you don’t want to spend a lot of time and resources testing.

Speaking of testing, why do we spend so much time and resources on testing?

Nearly all developers spend some non-trivial portion of their time testing, and application teams often have dedicated testers. Some companies have groups or even entire organizations devoted to finding errors in developers’ work products. Testers don’t deliver new features or business capabilities. Seems to me it would be faster and more cost effective to lose the testers and hire better developers. Ones that don’t make so many errors.

Obviously, I’m being facetious. Companies recognize that it is important for applications to function properly, and no matter how skilled the developers, testing is always necessary. And much of testing is focused on data input and output. Let’s hang on to that one and we’ll come back to it later.

Testing also requires that resources be allocated to the function. Like maintaining a healthy diet and exercise and practice and training, I’m not sure that many people really love testing (although I do know some that do). Why do any of it? Because at some point someone decided that it was necessary and that sufficiently bad things would happen if it wasn’t done. Well, sufficiently bad things are happening today because we don’t understand our data properly (remember those 70 – 95% project failures), and if something doesn’t change, they’re going to get worse.

The first step is to recognize that somebody has to start doing something they’re not already doing.

At companies, especially large ones, one of the hardest things to do is something that’s not already being done. Especially when it doesn’t come with incremental headcount. It usually takes a visionary executive to step up and volunteer to do it.

I’m going to assume that you don’t have a visionary executive (when it comes to data, let’s be clear) or a management mandate. What can be done?

When I was in graduate school, a friend had an issue of Cosmopolitan (I think it was) on the coffee table in her living room. As I recall it was roughly a thousand pages of clothing and perfume advertisements, but the cover teaser for an article caught my attention. The title was something like “How to Get Your Partner to Do What You Want Him to Do.” I considered it reconnaissance behind enemy lines. The gist of the article was that you can’t change your partner to do what you want him to do. All you can do is to change your own behavior and perhaps he will change his in response—but you can’t even count on that.

So, if you don’t have a management mandate and you can’t get the development and business teams to engage, then you have to be the one that does something that you’re not already doing. If you don’t, it is unreasonable to expect that anyone else will.

If nothing else, start profiling some data … any data … and start communicating the results. Today.

There’s no reason to delay. You don’t need to spend a lot of time. You don’t need anything fancy or automated or purchased. That can come later once you have some traction. Pick frequently used tables and critical data elements. Write a program or script or SQL query that does a COUNT and GROUP BY. Publish the results on your departmental website or in your quarterly newsletter. Report them during your next project status meeting and publish them with the minutes.

Storytelling with data is extremely important. We’ll talk more about that in a future article. But for now, this is where you have to start telling the story of your data.

Start generating profile data and asking questions. You may discover that everything is great, and that’s great! But experience suggests that you will quickly find something interesting.

Welcome to Square One.

Part 4: Orienting Outward

I guess I should say before continuing that the conundrum presented in this series might not necessarily describe your situation. Many companies already have robust information management programs and understand their data very well. These are also typically the companies that deftly adjust to changing market conditions, quickly release new business capabilities, and effectively monetize their data both internally and externally. Your company may be one of those that has already crossed the Data Chasm. If so, then great! You’re welcome to continue with us, and I’d be interested in hearing about your experience getting to that point.

Moving on.

Many of our teams have already come to the realization that we must proceed on our own. We have already started profiling our data or documenting our business terms or organizing our data models. Then what. We continue to try to involve the development teams. We continue to share the results of our data content analyses. We continue to create beautiful data model cubicle posters. And we continue to get the same lack of engagement. And it gets frustrating. It seems we’re right back to where we started 3200 words ago.

So, we start performing for ourselves.

We form closed groups, usually consisting of data architects, data modelers, DBAs, and data analysts. Sometimes we call them Competency Centers or Capability Centers or Centers of Excellence (CoE). We profile data or document business terms or organize data models. We do a ton of really good work. I think that most data modelers can recall a time when they worked on a model, extracted the entities from the business requirements, debated and decided on the primary keys, associated the attributes, and tied up all the foreign keys so that everything looked and felt just right. It was a work of art. The model was then presented at the next Information Management CoE meeting. They clapped when you finished.

Maybe the model was posted on an internal website. Maybe it was printed on E1 plotter paper and hung on the wall. Maybe it was occasionally referenced, but most likely it languished on a digital bookshelf with a million other dusty, unused files. Everyone outside the room could not have cared less.

Information management professionals have been ignored for so long that we seek validation in each other.

But as long as we are focused inwardly within our own echo chamber, we will remain stuck in a cycle of quality deliverables and corporate irrelevance.

We must turn our efforts outward.

But didn’t you just say that the development teams and their business partners won’t engage? That’s true. The team might not engage, but you can find individuals that recognize or can be convinced of the benefits of data understanding to their own careers. Sometimes it just takes one.

Back around 1999, a Marketing executive recognized the benefits of having data at his fingertips when he was in senior leadership meetings. He worked with his analytics team to generate reports and extracts that he viewed on his Palm Pilot. That’s right, Palm Pilot. Such limited mobile data availability seems primitive today, but at a time when decisions were made mostly on instinct and experience it was a revolutionary concept. In those meetings he was able to show data that supported his conclusions, ideas, and plans, and more often than not he won the support of his boss if not also his peers. Other executives saw the benefits of supporting their conclusions, ideas, and plans with data and so they worked with their analytics teams to generate reports and extracts for them to view on their Palm Pilots. There was no corporate Palm Pilot mandate. In fact, very few employees outside of the C-suite had them. It happened organically when one person saw an opportunity that benefitted them, and others followed suit for the same reason.

Transform your Information Management Center of Excellence to an Information Management Center of Evangelism, focusing your efforts on finding and cultivating allies in the development and business areas as well as in management.

You probably know some of these allies already, and you can find more. Be creative in your outreach. One idea is to hold an informational webinar about some hot data topic. Talk about the application of prescriptive analytics or deep learning or large language models in your industry. You know what will resonate most strongly with the people in your company. But be sure to link success to data understanding. Then see who seems interested, who stays behind to ask questions, or who wants to dive into the subject more deeply.

Those who recognize the benefits will already be inclined to support you. You need to give them reasons to become more active in their support. To carve off some portion of their time to work with you. Time that they will be pressured not to divert from their current activities. You know what will resonate most strongly with the people in your company. Crystalize the benefits and help articulate them to their management.

Nurture these new relationships. Support your new recruits.

Direct your efforts toward their data domains and applications. Those who work regularly with volunteers will tell you that the worst thing that you can do is to not engage someone who shows interest.

Part 5: Touching the Other Side

Relating the challenge of launching a new corporate process to a flywheel is a tired, overused metaphor. Probably because it so often applies. Like here.

It’s been hard getting started. Probably several false starts. Progress has been agonizingly slow. But now you’ve found a new ally. Hopefully several. They showed interest in working with you, allocated some time to pull together some definitions and expected content, and even got sign-off on the hour or two a week from their management. Maybe this time it will take off. The next step is yours:

Focus on consumption, seeking first to facilitate the work of the consumer.

When I was in college, I had the opportunity to participate in the Apple Macintosh II Seed Program. The school got a pre-release model of the Mac II, the first Macintosh with expansion slots, a separate monitor, and color video capabilities. Our job was to write applications for it. Like the previous Macintosh models, Mac II applications were built around asynchronous events and managers. Commonplace today, but at the time it was a revolutionary new programming concept. The introductory chapter of Inside Macintosh, the definitive developer resource, contained a section entitled “A Horse of a Different Color” that began:

On an innovative system like the Macintosh, programs don’t look quite the way they do on other systems. For example, instead of carrying out a sequence of steps in a predetermined order, your program is driven primarily by user actions (such as clicking and typing) whose order cannot be predicted. You’ll probably find that many of your preconceptions about how to write applications don’t apply here.

Seed Program participants from around the country assembled in Cupertino to learn how to write these applications. (No, we didn’t see Steve Jobs while we were there. He was not at Apple at the time.) The first speaker said something that remains in the forefront of my mind all these years later:

Providing a simple user experience is more complicated for the developer.

Too often we look to simplify our own work and then have the nerve to be surprised when our users don’t embrace the complicated or unintuitive solution we’ve presented to them.

Think obsessively about the user experience. Yes, give it that much attention. Create artifacts that serve your users and that make their jobs easier. Understand the questions that consumers will want to answer, when they will want to answer them, and the information they will need to answer them. Understand the details that development teams will have about new data, when those details will be available, and how that information will be used during development and testing. Recognize that information management is inherently tedious and try to make it as frictionless as possible.

Critically review your processes, especially those you’ve already implemented.

I’ve found that us data people also tend to be good at defining process. After all, we are detail-oriented. Yet, too often our attention returns inward and perfecting the process becomes the goal. We want the metadata and models and everything to be complete and accurate and reviewed and polished and tied up with a red bow before anybody else looks at them. We carefully map out every step, and each can be justified in the interest of metadata quality.

But would you want to use your processes? What if reimbursement or travel or procurement subjected you to processes like yours? Have your new friends review them. Would they use them? Are they intuitive? Do they answer the right questions? You may discover that you have to reevaluate the objectives of your processes if they conflict with your users’ needs.

One of the processes that may need to be revisited is the work product review.

Part of turning our focus outward is involving our new customers in our work product reviews. As I mentioned last time, any reviews that we’ve been doing have probably been for our own closed community. They probably involved stepping through a list of definitions or expected values or a data model projected onto the wall. This approach is not sustainable with our new audience. If we try to include them in those kinds of reviews, we’re going to lose them after the first meeting (if not sooner). We need a new way of communicating: storytelling.

Storytelling is the intersection of information, visualization, and narrative.

We have the information and visualization: the data elements or definitions or data models. We lack the narrative. The idea is not to gloss over the details, but to put them into a broader context. To establish a narrative thread that runs through the artifacts. Don’t just read definitions in isolation, relate them to their business purpose. Describe how they can be used to facilitate application integration and testing. It might be a little more work on our part, but it will greatly improve the customer experience. When done well, both you and your audience will have a deeper understanding of the content.

This article is not intended to be a primer on storytelling. There are myriad books, videos, and websites about storytelling with data. Although most are oriented toward reporting and business-side insights, the concepts can be applied to information management work products as well.

Data modeling is particularly suited for storytelling.

Many years ago, I worked with a data modeler who was brilliant at this. He would attend business requirements meetings and just listen. Occasionally he would draw some boxes, scribble some words, and connect up a few arrows. Eventually he would refer to his model and echo back a summary of the conversation. Something like, “So, each campaign can have multiple customers, but a single customer can only be part of one campaign at a time.” His interpretation would either be confirmed or corrected. If the latter, he would scribble out some part of the model, make the change, and try again. This would continue until the data model fully and accurately described the business scenario.

One of the greatest benefits of a data model is its use as a translation tool, allowing business and IT to communicate with precision in a way that is meaningful to both sides. Commercial software packages can translate a data model into a story, but it’s better if you do it for yourself for the time being. Don’t get distracted by tools. You’re making great progress and you’re so close to reaching the other side of the Data Chasm.

The key now is to make information management sustainable:

Incorporate information management (at minimum, expected content) into the application requirements.

This is your goal: partnering with your project management leadership and the business to make information management part of their standard operating procedures. Their processes, not yours. Leverage the value you’ve already demonstrated to secure support, especially from management.

It doesn’t matter whether you’re using waterfall or agile. Data modelers can sit with the business analysts as they develop the Business Requirements Specification. User Story acceptance criteria can be explicitly defined using data element expected content. You know your company’s processes. Engage your allies to help you find the best people to work with and the best places to plug in.

Equip your allies to articulate the benefits of information management to their teams.

It’s tempting for you to present the benefits yourself, but remember, to most everybody else you’re just somebody who parachuted in trying to make more work. Co-workers will have more credibility.

The point isn’t necessarily to create the task, “enter the expected content values into the metadata repository” (although that would be a good one), but rather to ensure that expected content is part of the test case definition so that later the details can be extracted and stored in the metadata repository (which, recall, might just be a set of well-organized spreadsheets on a SharePoint site). Eventually, and assuming you’ve designed your metadata repository for consumption, it will become easier to just put the details into the repository in the first place. But you need to make the barrier to adoption super-low.

I always make the following offer whenever I talk about information management to a project team: if you want to write the data element definitions in crayon on a napkin or whatever, I’ll type them into the repository. That’s how far you should be willing to go to facilitate the work of your customer.

Process integration completes your bridge across the Data Chasm. Maybe it’s just a rickety swinging rope bridge, but it’s a start.

And now may be the most precarious point of the whole journey.

You can begin taking advantage of some of the great resources that live on this side of the chasm, but please, please, please don’t rush headlong into new processes, deliverables, requirements, governance councils, and tools. Don’t neglect the connections that you worked so hard to establish.

You may want to use a framework like the DMBOK wheel as a roadmap for building out new information management capabilities, but focus expansion on the groups that are working with you. You don’t get bonus points for filling out all of the pie slices if they are not used (and we know what that feels like and we don’t want that to happen again). Keep doing everything you did to build the bridge. Continue to find and nurture new allies. Success begets success. Like a flywheel (there it is again).

Last week we celebrated the birthday of Martin Luther King, Jr. He concluded his 1960 Founder’s Day address at Spelman College with one of his most famous quotes, “If you can’t fly, run; if you can’t run, walk; if you can’t walk, crawl; but by all means keep moving.” This applies in so many areas of life. It applies here, too.

Never stop. Never lose focus. And in time that rope bridge will become a highway across the Data Chasm.

Published by Mark on January 26, 2024January 26, 2024

Data Ethics Part 2: You Are What You Eat

AI-Driven Data & Analytics Disruption: Data Modeling, Analytics, and Democratization

AI-Driven Data & Analytics Disruption: Data Quality and Metadata Collection

The Data Chasm: Complete

Published by Mark on January 26, 2024January 26, 2024

Related Posts

Data Ethics Part 2: You Are What You Eat

AI-Driven Data & Analytics Disruption: Data Modeling, Analytics, and Democratization

AI-Driven Data & Analytics Disruption: Data Quality and Metadata Collection