Through several articles this year, I’ve laid out a process for implementing Data Products, from Phase Negative One to Phase Five. Wouldn’t it be great it projects could be run that way? Everyone is aligned before we start, roles and responsibilities are clearly assigned, we don’t have to untangle dysfunctional processes, and we don’t have to repay data debt. I love the old joke: How was God able to create the heavens and the earth in six days? Because he didn’t have a previous install base. This is a best case or greenfield scenario, but most of us will never see it.

Nearly all companies already have established analytical environments: mature Data Warehouses, Data Lakes, and Data Lakehouses. Probably done a cloud migration as well. Maybe two. Built using deeply entrenched processes. But experience suggests that curation before delivery and support afterward are bare-bones at best. “Make the data available as quickly as possible,” such as it is. It’s only after the resource is deployed does anyone think seriously about management and governance. It’s 180 degrees backwards, but it’s also reality. Governance should precede delivery, but that ship has sailed. Waaaay sailed. As a result, very few existing data artifacts satisfy the Data Product requirements, and very few are sufficiently reliable to be used for AI.

We have the data but not the reliability.

Furthermore, reliability isn’t a function of the data. Let’s say you’re holding a handful of data. Is it a Data Product? You don’t know. In fact, it’s pretty much everything except the data that differentiates a Data Product from a Data Risk Vector. It’s about satisfying the Definition of Done for a Data Product, especially the processes and technology.

I’ve talked a lot about process before so I’m not going to repeat myself, but I do want to clarify something about technology. I know that I said not to worry about technology. That’s true, especially when you’re first establishing roles and responsibilities. Technology may be a secondary consideration, but I never said it was unnecessary. Instead, I warned that it is too often a distraction, diverting focus and draining energy away from working through the more difficult and more important issues.

Besides, if you already have an existing analytical environment, then you already have a technical stack: repositories, transformation pipelines, development and project management platforms, maybe even a metadata repository. It’s OK to start by leveraging whatever existing resources you have, even if it means storing accountability details in notebooks and support contacts in Excel spreadsheets. But eventually you’ll need to move beyond those.

Ensuring reliability requires automation. 

Data preparation processes have long been automated and are, for the most part, reliable. Same for the repositories and data access tools. It can be a mixed bag when it comes to support and monitoring. But most enterprises have fallen short in automating the curation, validation, maintenance, and process accountability. 

We have data infrastructure but not reliability infrastructure.

Data Contracts and dbt were created to provide that automated reliability infrastructure. The prescription is simple, but as you might expect, it requires commitment and discipline. Introducing Data Products into an existing analytical ecosystem requires three-track thinking and three-track planning. The first focuses on leveraging or enhancing the existing processes and infrastructure to provide the foundation upon which Data Products will be built. The second leverages those processes and infrastructure to opportunistically harden existing artifacts into Data Products.  

Complete two gap analyses, then systematically plug the holes.

Start with the template for a greenfield Data Product project. Think of it as a checklist of things that you need to get done, both for the environment as a whole and for an individual Data Product. Some development might be required, especially when it comes to monitoring and data quality, but the hardest parts will be getting agreement and participation for the processes. That’s why you still need to start with Phase Negative One. Keep ticking off items from the list. When everything’s checked, you’ve got a Data Product and a support infrastructure that is sufficiently robust to ensure reliability. 

The third track, then, is to deploy new analytical artifacts as proper Data Products in the first place.

As you go along, be sure to differentiate between Data Products and unreliable datasets (and, yes, I would start referring to them in that way). Store them in a separate database, prefix the name with DP_, or place them in a designated folder within the analytical tool. Assign them to a different security group. After a while, require special permission to access non-Data Product datasets, but that won’t be feasible at first.

Prioritize your most useful or riskiest datasets. Work with teams that are willing to work with you. Demonstrate success. Reach a critical mass of identified Data Products and the others will follow.

“You fight with the army you have, not the army you wish you had.” The same applies to transitioning an existing analytical environment to support Data Products. After all, the best way to do something that’s worth doing but hard is to:

1) Commit to taking the first step,

2) Commit to take each subsequent step, one at a time if necessary

Categories: Data Products