Listen to this article:
At this point you are, no doubt, ready to get this thing started already and burst out of the gate coding. You know you have already started the project, but I get it. But just one more minute. We need to discuss something I mentioned briefly last time:
Phase 2: Data Contract
More jargon, I know. But you’ll be hearing a lot about these if you haven’t already. They’re also filed under the categories everything old is new again and those who forget history are condemned to repeat it.
Some of you reading this have been around long enough to remember CASE Tools.
The idea was to generate software from structured specifications. The most popular CASE Tool was IBM Rational Rose. PowerDesigner and ERWin could also be considered CASE Tools for data modeling and databases.
The idea was that CASE Tools defined systems through models and diagrams, creating a single source of truth. The code was then generated from the design. If a change was required in the application or database, the model in the CASE Tool was edited and the code was regenerated to implement the change.
It rarely worked out that way in practice.
Except in the most structured and disciplined organizations, the diagrams and the code diverged almost immediately. Developers frequently bypassed the CASE Tool and updated the applications directly. Worse, if an update was later made within the tool and the code regenerated, the manual change was lost and had to be rediscovered and reimplemented. Preferably the right way the second time.
Data Contracts look very similar.
Data Contract is a formal, technical agreement between a data producer and data consumer about the structure and behavior of data, often within the context of a Data Product.
They include detailed schemas, semantics, quality expectations, freshness requirements, delivery mechanism details, and change management rules. All stuff we included in the Data Product Specification. The formal Open Data Contract Standard can be found here. Metadata tools are starting to include Data Contract definition and implementation support.
So why, then, do we need yet another thing? Two reasons. First, the information in a Data Product Contract is a subset of the information collected in the Data Product Specification. Second, Data Product Contracts are stored in machine-readable form (e.g. YAML, JASON, etc.) and are used in development and production automation.
The Data Product Contract is the link between design and implementation as well as the key to automated, decentralized governance that ensures ownership, quality, and accountability.
Consider how disconnected traditional Data Governance is from the running code. Most companies consider it to be documentation and treat it like documentation. In other words, as optional. If the application changes or the data content changes or the semantics change and nobody updates the metadata does anybody notice? Or care? Outside of process discipline, the only points of intersection are the Data Quality checks, and many (if not most) companies don’t even do that. So, how do we keep from going down the same CASE tool path with Data Contracts? Two ways:
First, they must be owned by the data producers, not centralized data or architecture teams.
Second, they must be integrated into the tooling (both development and production) and enforced at runtime.
Any Data Contract initiative that isn’t executable and enforced through automation will fail for the exact same reasons that CASE tools did.
That’s the difference. With CASE Tools, the design defined the reality. This was great in theory but failed in practice. With Data Contracts, reality must conform to the design or the system fails. It’s like defining Data Quality requirements, but more so. Yes, Data Quality requirements are part of a Data Contract, but so too is the schema, ownership, and all that other stuff.
Finally, Data Contracts are a company’s only defense against the chaos caused by creating new datasets for every report, analysis, model, or question because storage is cheap or we don’t want to bother looking for an existing dataset. I’ll come back to this disturbing trend in a future article.
OK. It’s time to start coding. Where were we?
Phase 3: Build and Integrate
I’m not going to be prescriptive here because everyone has done this part before. Each enterprise has its own development processes and requirements.
Your first project may be challenging, though, because you’ll probably be defining and implementing Data Contract integration at the same time you’re implementing your Data Product. Once that’s done, subsequent projects will go much faster. That’s kind of the way it is with everything.
Implement the Data Product according to the Contract.
You would think this would go without saying, but we’ve seen too many times developers on a project with a whole lot of really great up-front work go off and start doing their own thing once coding begins.
Don’t invent while coding. If a question arises, get an answer. Don’t assume. Don’t guess. The rest is all stuff that we’ve all done before: source extraction, transformation and modeling, data quality rule implementation, monitoring and alert setup, and incremental consumer validation.
Implementation is finished when the Data Product satisfies the Contract under real conditions (not just test data), all of the freshness and quality requirements are met, monitoring and alerting are live, and the application contains no undocumented logic. This last one is how we avoid divergence from the Contract. Leverage AI to confirm this by analyzing the code against the Contract.
One particular pitfall to avoid here: do not treat prototypes as products. We all want to leverage the work we’ve done and if we’ve got it to the point where it’s useful to someone, why not just let them have it? Because prototypes are almost always fragile and unmaintainable. If it’s worth delivering, it’s worth delivering right. You’ll thank yourself later. And your consumers will thank you.
Phase 4: Consumer Enablement and Adoption
Hooray! The data is getting where it needs to go and appears to contain what it needs to contain. Mission accomplished. Chicken lunches all around!!
Just because we’ve delivered a Data Contract, code, and data doesn’t mean we’re finished. Adoption is a feature. Do not assume the Data Product will be used, or that if it is used it will be used correctly. Consumers need to know when and, perhaps more importantly, when not to use the Data Product. You actually worked this out in an earlier phase. Make sure everyone knows it.
It is the Data Product producer’s responsibility to ensure that the Data Product is used correctly and independently. The independently part is key.
The Definition of Done for this phase is when at least one consumer depends upon the product without hand-holding.
In theory if you’ve done everything else properly there shouldn’t be a problem now, but we all know that there are always problems. Anticipate them. Plan for them. Resolve them. And finally, deprecate duplicate existing systems.
Phase 5: Operationalization and Handoff
The Data Product is implemented. Now, make it sustainable.
You’ve already defined the ownership and escalation paths, feature and defect backlog management processes, and change management processes. Make sure they’re working smoothly. Sand out any rough spots. Monitor usage and quality trends. Your company may require the delivery of a run book to operations.
Here’s the final test:
Can the Data Product survive team turnover and source system change? If a new analytics team joins in six months, could they build this without talking to you?
When the answer is “yes,” then you can celebrate.
And move on to the next one.