It’s amazing how quickly Data Product joined Metadata, Data Quality, and Data Governance as a functionally useless term. Consult ten different sources and you’ll get fifteen different definitions. A Data Product is variously defined as:
- The theory of applying a product mindset to data sets.
- Any tool or application that processes data and generates insights.
- A reusable asset that delivers a trusted dataset for a specific purpose.
- Software designed to support data used as a service.
And many, many more. Remarkably, this ambiguity is almost entirely of our own making, and I guess now is as good a time as any to address it.
Somewhere along the line we thought it would be a good idea to differentiate “Data Product” from “Data as a Product” and then use the two interchangeably in practice.
The original definition of Data Product published in 2012 was: a product that facilitates an end goal through the use of data. I suppose that’s useful in that it provides an umbrella under which one could include everything. Which, I suppose, makes it not so useful. That definition has been refined over the past decade, and today the consensus definition is: an application or tool that uses data to provide a service or to solve a problem. Examples include reports, dashboards, alerts, triggers, health monitoring applications, financial analysis tools, customer analytics platforms, and smart home devices. By this definition, the app on your smart watch that displays the current temperature and the light alerting you to low tire pressure are Data Products.
On the other hand, the consensus definition of Data as a Product is: treating data as a trusted commodity built and maintained for users, with an explicit emphasis on quality, accessibility, and utility. It’s what I’ve been talking about in previous articles. To co-opt a famous Fred Smith quote, “The information about the data is at least as important as the data itself.”
But, is this a really a hair we want to split?
Data Product and Data as a Product are two parts of the same whole, and each is incomplete without the other.
The definition of Data Product encompasses the data itself, certainly, but it doesn’t include any requirement for understanding the data, for supporting it, or for maintaining it. This should sound familiar. It’s what we’ve been doing in the analytics space for decades. You didn’t know it at the time, but that COBOL program you (or your parent) wrote in 1992 produced, yup, a Data Product. It’s the old summaries, extracts, models, APIs, and services with a new label. We get to claim to be doing something forward-thinking and modern without having to do anything different.
The customer profile report just handed to you is a “Data Product,” but you have no idea whether there’s any curation, certification, or maintenance behind it. That’s what’s required for it to be considered Data as a Product. So, in that case is it a Data as a Product Data Product? Data as a Product Data?
The vocabulary doesn’t permit differentiation between curated and uncurated Data Products.
This is the most critical distinction. It’s what separates the trustworthy from the unreliable.
Data Products, defined in this way, are not sufficient to sustainably support artificial intelligence, machine learning, and analytics because reliability is not a requirement.
To be useful, Data Products must also be Data as a Products (or is it Datas as a Product, or maybe the singular and plural are the same). A Data Product requires all of the Data as a Product stuff: curation, catalog, certification, and maintenance. Without it, the Data Product is not reliable. It’s just a summary or report or some other seat-of-the-pants artifact. It is disingenuous to call it a Data Product, giving it a moniker that connotes reliability. It’s a false sense of security, and undermines our analytics efforts while pretending to support them.
Therefore, the distinction between Data Product and Data as a Product is unnecessary.
So, do we want to create yet another new term? Maybe something like “Curated Data Product” or “Trusted Data Product” or “Certified Data Product”? I would prefer instead that we make the definition of “Data Product” more useful and more consistent with its current connotation and utilization: that it includes the data and the artifacts and activities that make it reliable and trustworthy.
If we really need a term that encompasses everything that’s data, and I’m not suggesting that we do, but in that case I would advocate for Data Artifact. An artifact is something created or observed. Nothing more. It simply exists. “Artifact” more honestly and accurately describes the lowest common data denominator than “Product.”
But whatever we call it, remember that our users don’t care about Data Products or Data as a Product or whatever. They want the data. They want to be able to find it and use it. And they want the reliability. Period.