Here we go again. We seem to have an irresistible reflex to prepend “data” to every noun, verb, and adjective in the dictionary. Here’s another: Data Contract. 

My first thought, when I heard about Data Contracts, was that we’re creating a new term to describe something that should already be being done like it’s something new. 

Perhaps. Or perhaps not.

If you’re considering implementing a Data Fabric, especially using Data Products, it won’t be too long before you run across Data Contracts. 

Data Contracts are agreements between data producers and data consumers that specify data structure, semantics, responsibilities, and service-level commitments.

We’ve certainly seen something like this before. After all, interface agreement is required for any automated data exchange, whether internally within a company or externally between companies. Standard APIs or specifications are used in Finance (SWIFT), Healthcare (FHIR), and Retail (EDI) just to name a few. 

OK. Sounds like we’re just repackaging again. 

Let’s look more closely at what a Data Contract contains. Most of these are fairly self-evident. A Data Contract includes:

  • Metadata for the dataset itself
  • Schema definition
  • Metadata for the data elements
  • Operational expectations / SLAs
  • Quality and validation rules
  • Access and security controls
  • Consumer obligations

Seem familiar?

    My second thought was that this looks an awful lot like the key requirements for a Data Product: data, metadata, lifecycle management, and support.

    That would explain why Data Contracts are often used to accumulate the details required for creating, executing, maintaining, and supporting Data Products.

    The information in a Data Contract is captured in a structured and easily interpreted format, although there’s not yet widespread agreement on what that structured and easily interpreted format should be. Some use JSON. Others YAML or XML. And still others use a variety of lesser-known or proprietary formats. This lack of widespread agreement turns out to be a recurring challenge when it comes to Data Contracts.

    I’ve always said that metadata should flow from source to target like the data itself. Define the descriptions, expected content, etc. once at the System of Record and propagate them to wherever that data is used. Don’t create everything from scratch at each destination.

    Data Contracts enable that vision. 

    Furthermore, the schema, content, support, maintenance, SLAs, and more can be validated against that commitment. And much of that validation can be automated.

    Not everything can be automated, though. Like any “documentation,” using Data Contracts requires discipline. They require ongoing maintenance. It’s the responsibility of the data producer to ensure their accuracy. A disconnected, abandoned, or neglected Data Contract is no better than the documentation that we haven’t been producing for decades. Worse, really, since it will inevitably lead to application errors, AI hallucinations, faulty decisions, and will ultimately undermine trust in the data. And when that happens, it’s extraordinarily difficult to get it back.

    Today, Data Contracts are most often used to support Data Products and the Data Mesh and Data Fabric architectures built from them. 

    A Data Contract can be used as the template for all of the details that need to be collected, defined, specified, and aligned for Data Products to be used in production.

    Another common current use is to support and automate Data Quality and compliance. The information stored in the Data Contracts can drive Data Quality tools. Assessing completeness can be incorporated into the CI/CD pipeline.

    Unfortunately, tools and frameworks that create, publish, and validate Data Contracts are proprietary, in development, or vaporware. Some practitioners and early adopters have advocated for dbt as the Data Contract architecture. It stands for Data Build Tool, even though the abbreviation is not capitalized, and is an open-source command line tool that streamlines data transformation and modeling within data warehouses. It looks like a combination of SQL and a templating language, and has been around since 2016. The current version was released in 2021.

    With such a new technology, and the overwhelming attention directed toward AI, vendors have been hesitant to invest in developing GUI Data Contract Management Systems. Nevertheless, a few tools are starting to appear, but you don’t have to invest in a tool yet. Just gathering the information required for the Data Contracts is an excellent, and necessary, first step. 

    Let’s keep an eye on this one.

    As distributed data and analytics architectures become more widely accepted and implemented, Data Contracts are likely to become a key component: the glue holding those architectures together.