A previous article established that the primary expectation of Data Product consumers is Reliability. This week we’ll look more closely at how we can provide that reliability. 

The five responsibilities of a Data Product provider are Content, Curation, Catalog, Certification, and Maintenance.

Content: Data Products are created to satisfy anticipated and/or existing business needs. By sharing data artifacts, marginal cost per query decreases and metric standardization across the enterprise increases. This is the easy part since the physical data asset will always be created without any additional prompting. We’ll talk more about this in a future article. Today, I want to focus our attention on the other, equally important components.

Curation: Data Understanding is the necessary foundation for all Data Products. This includes the usual business, technical, and operational metadata suspects like business description and intended business use, expected content, lineage, calculations and transformations, architecture, and security, privacy, and retention requirements. This seems to be the place where most Data Product initiatives fall apart. It shouldn’t be a surprise. Most of us don’t do a very good job of Data Understanding generally, and we need to do more than just wrap existing (often deficient) processes in a new Data Product label.

Catalog: The information about the Data Products collected in the Curation activity are made actionable through the Data Product Marketplace or Catalog. This can be just a publicly posted spreadsheet or document, but we can do better. Users expect to get answers to business questions through a self-contained, self-service interface. Most commercial Data Product Marketplace software allows users to access Data Products directly or through common business intelligence tools. The more accurate and more complete the information collected during Curation, the faster and more confidently the users will be able to find the data that they need. The Catalog should also contain Service Level Objectives and Service Level Agreements, terms of use, and pricing or chargeback terms. It can also be a repository for data quality statistics and utilization metrics.

Certification: Tell me if this sounds familiar. The analytics platform team receives a request for data. They work with the source system(s) to establish a new data feed. The data in the feed matches the data in the source system. Move it into production. Next. The only time anybody thinks about it after that is when it’s delayed or the users discover a problem with the data (or worse, an external customer discovers a problem with the data). After all, the platform team is already managing hundreds if not thousands of feeds. This approach is not acceptable for a Data Product. Recall that the hallmark of a Data Product is reliability. What I just described doesn’t sound much like reliability, does it? It’s not enough to compare Data Product content to the source content. It’s not even enough to just profile the data when the Data Product is first released. Certification—validation—is an ongoing operational activity. Of course, the quality of the Certification directly depends upon the quality of your Curation. After all, if you don’t know what the data is supposed to contain, you can’t assess the Data Quality. Pro tip: include the last audit date in the metadata.

Maintenance: OK, I couldn’t think of a C-word for this one. Like Certification, this is an ongoing operational activity that contributes to reliability by ensuring the continual availability and applicability of the Data Product. Some of this can be automated and integrated into existing monitoring processes. Data Profiles can be run automatically, with an operator notified whenever an anomaly is discovered. Similarly, the arrival of new data or the calculation of the Data Product can be monitored. There’s a second component as well: keeping up with the business purpose and business meaning of the Data Product, and the data from which it is produced. Changes in source system semantics or content may necessitate changes to the Data Product. Reliability means that Data Products accurately reflect the business purpose for which they were created.

These five activities are fundamental to effective and successful Data Products, and distinguish Data Products from departmental summaries and Data Marts. In the coming weeks we will continue the conversation about Data Products, covering content, implementation, and management.