Data Product Architecture Part 1: Where

Awesome! You’re moving forward with a modern analytics architecture built upon Data Products. Maybe you’re going to implement a Data Mesh and/or Data Fabric. Maybe you’re just looking to make your enterprise analytics more scalable and sustainable. Whatever the reason, you now have a couple of architectural decisions to make; specifically, where the data will be stored and how it will be organized. The choice of one influences the other, but since I can’t talk about both at the same time I’ll pick one to start: storage strategy. And, as with most every architecture question, the answer is “it depends.”

It used to be so easy. For years, analytics storage strategy was pretty straightforward. A centralized data warehouse. Some data marts. Maybe an operational data store if you must. Later, a data lake or lakehouse. But most everybody lived in the same neighborhood. Today, we don’t have to reflexively accumulate data into a centralized repository. Today, analytical data management can (and should) be the responsibility of the source system teams. Today, data can be located anywhere. In other words, Data Product data can be federated or consolidated (or both!).

With a federated architecture, data can be stored on different servers in different physical locations.

Data is mastered remotely and accessed remotely. ETL tools access data wherever it is. Database platforms provide connectors that allow SQL query access to the data wherever it is. Data Fabric, Data Mesh, and Data Products are all emerging technologies in the Gartner Hype Cycle and many vendors provide tools that support Data Product access within a federated architecture.

Performance and utilization are the key factors when laying out the topology.

The most significant consideration with a federated architecture is the time that it takes to move the data between geographical locations. It is axiomatic that data has to be co-located in order to be associated or joined. Performance is overwhelmingly determined by the amount of data that has to be moved and the speed at which it can be moved. Modern databases are getting smarter about pushing processing down to the remote servers to minimize the amount of data crossing the network.

My first encounter with data federation was in the mid-1990s. Back then, though, network bandwidth limited its usability. Consolidating the data was the only practical option. Now, networks carry data orders of magnitude faster, and keeping the data closer to the source system (or wherever) is more feasible. In fact, federation is commonplace in the cloud. Most of it happens behind the scenes, managed by the cloud providers. Data regularly moves between multiple data centers in different parts of the United States and around the world. I’ve long said that the separation of storage and processing makes it possible to use the same data for both operational and analytical purposes, but I’m not sure you want to do that. Not yet, at least.

The next consideration is how the data will be used. Self-service is a given. Data Scientists and more advanced users will access the data wherever it’s located and materialize data sets for use in subsequent analyses and/or model training. On the other hand, businessfolks prefer to use query, reporting, and visualization tools. When the data is federated, then accessing, transferring, and composing the data on the fly may take too long, consume too many resources, and result in excessive costs. In that case, materializing joins (which we’ll talk more about in Part 2) and consolidating data from multiple sources into the same repository may be preferable. Which brings us to the other end of the continuum.

With a consolidated architecture, data is stored together in (roughly) the same physical location.

This is what we’re all familiar with. Data mastered remotely and accessed locally. A Data Warehouse, preferably comprised of Data Products. Until recently, “locally” meant within the same repository, but not necessarily anymore. The data could be in different repositories on the same server or in the same cloud. It could be stored in files or in some flavor of database (e.g. relational, columnar, graph, etc.). I suppose if you want to be pedantic about it, “consolidated” data would all be in the same repository, but I would extend that that to include data on the same server or the same cloud platform in a single geographic location. And as with most every architecture answer, the boundaries are fuzzy. The point is that the data doesn’t have to travel very far, and certainly not across a wide area network.

The most obvious benefit of a consolidated architecture is that query performance will almost always be better. I’m sure that someone could come up with a contradictory use case, but that would be the exception. The most obvious cons of a consolidated architecture are that the data will have to be transported to the central repository, those processes will have to be implemented and managed, and the data will have to be replicated. Again, we’ll talk more about materialization in Part 2. This may result in more work for the source system teams. Work that they’re probably not super-excited to be doing in the first place. One option for incentivizing this approach is for the enterprise analytics team to pay the storage, processing, and network costs associating with preparing and transporting the data instead of the source system team.

All that said:

From the Data Consumer’s perspective, it shouldn’t matter.

The Data Product Catalog simply presents a pallet of assets. The query language or data preparation or AI modeling workbench or analytics tool handles the access details. It is useful, though, for the Data Consumer to be aware of federated data access to avoid unpleasant surprises when their analytics consumption bill arrives.

From the Data Producer’s perspective, it shouldn’t matter.

As the enterprise transitions to a modern analytics architecture, one change is that the enterprise analytics team is no longer responsible for data preparation, but for providing the tools and processes used by the source system teams which prepare it, move it, and curate it. The enterprise team then keeps track of where the data is, centralizes the information about that data, and makes it available for use. When determining the physical implementation of Data Products, consider using open table formats (like Iceberg or Delta), but that shouldn’t matter to the Data Producer either.

Finally, chances are that you’re not starting with a clean slate. You probably already have an established data warehouse, data lake, and/or data lakehouse. Consider building up your Data Product inventory incrementally, starting fresh with new data and creating Data Products from existing assets when the opportunity presents itself. But be intentional about when to federate and when to consolidate. Don’t just put the data somewhere else to put it somewhere else. There are performance consequences. And don’t just reflexively put everything into a central location either. There can be data freshness and process overhead consequences. As always, “it depends.” Understand the trade-offs and be sensitive to cost, effort, and user experience in your decision-making process. Architecture is equal parts art and science.

In Part 2 we’ll look at the other consideration: how the data content is organized.

Data Product Architecture Part 1: Where

Published by Mark on October 9, 2024October 9, 2024

Data Products: Your Data Debt Multi-Tool

Data Products: Prepared for the Next Next Big Thing

Architecting Santa’s List: The Naughty-Nice Database

Data Product Architecture Part 1: Where

Published by Mark on October 9, 2024October 9, 2024

Related Posts

Data Products: Your Data Debt Multi-Tool

Data Products: Prepared for the Next Next Big Thing

Architecting Santa’s List: The Naughty-Nice Database