With increased recognition that the new AI models and applications need high-quality, well understood data, Data Products are enjoying renewed attention. The concept has been fleshed out in recent years with definitions, reference architectures, and platforms. They consist of …. Actually, let’s not worry about what Data Products consist of. At least not right now. We’ll get into all that here at some point. But that’s not the important part. Instead, let’s take it all the way back to the bare walls and start where we should always start: the consumer.

Imagine you’re an analytical data consumer. Maybe an analyst or data scientist or whatever. You have a question to answer and you need data. You want to spend your time generating insights, but too often you end up spending all your time finding, gathering, validating, and cleansing the data first. So much corrupted time.

But wasn’t that what Data Warehouses and Data Marts and Data Lakes and Data Lakehouses were for? They certainly help with some of the gathering and finding, but they don’t seem to be working for the validating and cleansing. Many appear to have given up on the problem. Validation and cleansing capabilities have been incorporated into several existing analytical tools and built into their standard workflows. The evidence suggests a pervasive and fundamental lack of trust in the data. And that brings us back to Data Products and the reason for their existence in the first place. What is the difference between a Data Product and a Data Mart, summary, or shared table? What do consumers want?

From the consumers’ perspective, the key differentiator of a Data Product is reliability.

As a purveyor of Data Products, your customers’ single expectation is that you provide reliable data. To be successful you must earn their trust, and that’s more than just “quality because I said so.”

Think about how often we take reliability and trust in authority for granted in other areas of life. We don’t give a second thought to doing our own data validation and profiling exercises on the data sets we are considering for our analyses. But do you ever open the box of Corn Flakes you just took down from the grocery store shelf to make sure that it has Corn Flakes in it. Of course not. That’s silly. Now, if the bags, boxes, and cans were unlabeled you’d have to open each one to see what was in it. Eventually you’d find the Corn Flakes. It’s no different with data. Your users shouldn’t have to open every bag, box, or can of data to discover what it contains.

You are the authority that vouches for the data so that every one of your users doesn’t have to do it themselves individually and repeatedly.

So, from the users’ perspective, what do they expect you to do to ensure that the data is reliable?

They expect the data to be labeled completely and accurately. This is basic Data Understanding. Definitions. Expected content. Lineage. Transformations. This allows them to quickly and easily find the data they need. It’s only minimally useful for a box containing Corn Flakes to be labeled “Cereal” or “Flakes.” 

They expect the data to be accessible. IT and even most data folks approach the question of finding data differently than most business and analytics consumers. Think about your Data Product Marketplace software from the users’ perspective. Talk with them. Conduct focus groups. Make it easy for them to find what they’re looking for, to compare candidate Data Products (like comparing multiple television models), and to access that data through their preferred analytics tools.

They expect the data to always contain its expected content. The most significant distinguishing characteristic of a Data Product is your assurance that its contents are always correct. That you are doing the research and validation and curation so that your users don’t have to. Unless the Data Product is intended to capture the state of the business at a single point in time, the expectation is that it will be kept up to date, reflecting changes in the business as implemented in the source systems and propagated through the data. It requires continuous Data Quality monitoring and continuous maintenance.

You, of course, have more to think about than that. For example, security is extremely important to you and to the company, but maybe not so much to your users. Generally speaking, data consumers only care about security when they can’t get access to the data they need and then get annoyed because they have to wait to finish what they were working on while they get the necessary approvals.

The point is that there’s a lot involved in creating and curating Data Products, as well as in deploying a Data Product Marketplace. But that’s the level of service we should want to provide to our users. It’s the service that will accelerate AI, ML, and advanced analytics delivery and improve the quality of our models. The service that will become a competitive differentiator for our company. It’s what our users expect from us. And it’s not a technical challenge. 

Data Product development is first and foremost a mindset requiring culture and discipline.

Technology can facilitate, but technology alone is not sufficient. I’ve seen the Data Product label on Data Marts, summary tables, and even raw data with none of the curation or monitoring. I wonder how many of us are gaslighting our users by claiming that our Data Products are reliable when we don’t even know what the data is supposed to contain.

We’ll talk about definitions, reference architectures, and platforms soon, but none of it will be successful without a culture that values Data Understanding and the discipline to fully incorporate it into the development processes. If you’re interested in AI (and most everyone is), and you’re interested in using Data Products to train your models (because accurate models require accurate data), then this is where you have to start.

Cover image, “The Flying Caceres” by Arthur T. LaBar at flickr.com. Copyright 2015, some rights reserved.