Listen to this article:
I trust that by now that you’ve completed Phase Negative One and Phase Zero in your Data Products journey. Congratulations, you’ve reached the starting line. Ready to implement something? Almost.
Let’s recap.
Phase Negative One: Preparation
You have convinced the enterprise, and especially upper management, to pursue Data Products. This may have required some education, both by you and by management. You’ve established a basic infrastructure without getting distracted by technology. You’ve looked at the existing corporate processes and determined what needs to be changed or if necessary added to support the delivery of reliable Data Products. Remember, reliability is as important as the data itself, if not more so. Finally, you’ve identified a pilot project partner and something that you are going to deliver together.
Phase Zero: Data Product Charter
I discussed all of the Data Product Charter components here, so I won’t repeat myself, but I do want to underscore a couple. First, domain alignment is the backbone of your Data Product. I can’t overstate the importance of getting that right. Many companies don’t do it at all, and skipping this step guarantees rework later. Also, do everything possible to be consistent across the enterprise. We’re starting to see ontology sprawl in the same way we’ve seen uncontrolled data mart proliferation, data swamps, and now AI Agent sprawl. Your ontology should be as enterprise-wide as possible. It may take some work and some convincing, but it’s worth it to get agreement. Second, establish clear ownership. You’re going to be asking individuals and teams to do things that they haven’t done before. Well, they have been doing them before, just not “on the books” or very efficiently or according to a plan.
OK. Moving on.
As we move closer and closer to implementation and deployment, it becomes increasingly difficult to be prescriptive since each enterprise has to build its own data production and delivery supply chains, oftentimes from scratch. There’s not a one-size-fits-all approach that can be defined, implemented, and replicated across enterprises. Instead, I’ll outline the components of the Data Product delivery pipeline.
Phase One: Data Product Design and Specification
I want to take a minute here and address what may be an elephant in the room. We’ve finished two phases of our Data Product deployment project and we have yet to write any code. Spoiler alert: We don’t write any code in this phase either. But I thought that’s why we adopted an agile methodology; so that we can start writing code on day one and if we guess wrong we fix it along the way.
This is a common misinterpretation of agile.
Agile doesn’t mean “start coding and we’ll figure it out later.” We don’t ignore the risk of being wrong, we reduce the cost of being wrong. The reality is that “we’ll fix it later” is only cheap for small mistakes. Undoing a flawed model or architecture is much costlier. We don’t skip the thinking, we just shorten the distance between the thinking, building, and learning. We’ll get to the code soon enough.
(Second spoiler alert: In the not too distant future, with AI the overwhelming majority of the coding will be done by machine, making the completeness and correctness of these preliminary phases even more important.)
Continuing our annoyingly insatiable appetite for new vocabulary, I’ve been seeing another one: Data Product Contract. Before you ask, no, it’s not the same thing as a Data Contract. Crap. We’ll need to talk about Data Contracts shortly, but tell you what, I’m going to break with tradition and refer to this thing as a Data Product Specification. I understand the use of the term contract since it is an agreement between the Data Product producer and consumer. It includes what is guaranteed and not guaranteed, as well as how and when the guarantees change. Nevertheless, let’s call it a specification to more clearly differentiate it.
The Data Product Specification converts domain understanding into a consumable and testable blueprint.
The idea is to do enough upfront thinking to identify the target state and avoid the obvious dead ends along the way. It exists under change management just like any other project documentation, so existing processes for requesting, reviewing, and approving changes can be leveraged. It’s helpful if all of this stuff is in both human and machine-readable form, but we’ll talk about that in a future article covering Data Contracts.
Data Product Specification Components:
- Inbound Interfaces: We identified the primary contributing operational systems in Phase Zero. They get formalized here, confirming that they do, in fact, contain the data required for the Data Product.
- Calculations and Aggregations: Define the metrics, granularity, and dimensions. Tie those definitions to the inbound data. This also helps ensure that we’ve got the right source data. At this point, the consumer should have a pretty complete picture of the data that will be available to them.
This is where most planning efforts stop, treating the specification as simply a schema. But not us. Not Data Products. Or at least not reliable ones.
- Data Quality Expectations: This is easy to put off because it’s exacting work and requires an understanding of the inbound data content. Most companies don’t have an understanding of their data content, so that understanding may have to be developed from scratch. Don’t wait to start. Data Quality is the foundation of reliability. You know the dimensions: completeness, validity, timeliness, freshness, etc. Define quality thresholds as well as the processes for measuring, communicating, analyzing, and resolving deviations. Remember, these are operational issues and are not optional.
- Refresh Cadence, Latency, and Availability Expectations: Data is like bread and becomes stale after a while. How and when will the Data Product be updated? Will the whole thing be periodically recalculated or will individual values be continuously updated? Will there be blackout windows? When can new data be expected to be incorporated into the Data Product? Stuff like that.
- History / Change Strategy: A fundamental decision must be made about whether historical data will be available in the Data Product. I’ve seen it done both ways, and differing assumptions about backward compatibility is a common failure mode. Sometimes the business just wants to see the current state and it’s OK to project that current state backward in time. For example, simply updating a customer address in a lookup table will make it appear that the customer had always been there. On the other hand, the previous state may need to be preserved like Slowly Changing Dimensions. This applies to metrics as well as dimensions.
- Access Pattern Definition: Will the data be retrieved through ad hoc SQL or through an API or analytics tool? It makes a difference. Streaming or file-based? Will it be used directly or will it be associated with other data? Normalized or denormalized? Indexed, clustered, or partitioned? All of that needs to be specified up front.
- Usage Guidelines: These include how the data should be used and who is allowed to use it. In Phase Zero you identified at least one concrete downstream dependency. You may have attracted additional consumers at this point. Just as important are the anti-patterns, both from the business and technical perspectives. For example, you might not want a particular Data Product to be used for Revenue Reporting because it doesn’t include certain types of transactions. Or, Data Product should not be used in ad hoc queries because the data is stored on a platform that makes joins very expensive.
- High-Level Logical Data Model: I can hear everybody’s eyes rolling. Panic on the streets of London. We’ve spent decades not doing logical data models. Why should we do them now? And look where we are. Hear me out. I’m not suggesting we do what we usually do, which is to lock some modelers in a little room with a whiteboard and let them fight it out until they emerge six weeks later with a poster full of boxes and lines that gets hung on the wall and never looked at again. We don’t need to get down into every entity and attribute. The purpose here is to articulate the information structure of the Data Product. I refer to it as the Big Box Diagram. Others might call it a Conceptual Data Model. It consists of the domains or subject areas and how they relate to each other. If you’re developing a Foundational Data Product, you’ll probably just have one domain and just one Big Box (although you might want to include other Big Boxes for other domains to illustrate how yours relates to the others). It’s particularly important to work through those relationships when you deploy Composed Data Products that span domains.
The Definition of Done for Phase One is when consumers know exactly what they’re going to get, and how we’re going to ensure that it’s reliable.
I’ll keep repeating it: the “product” in Data Product is reliability. The guarantees are explicit. And reliability isn’t just about correctness today. It’s also about predictability tomorrow.
Next time, we start building.