Part 1 of this Data Product Architecture series covered the “where” of Data Product data implementation: federated or consolidated or some combination of the two. Here in Part 2, we’ll get into the “how” of Data Product data implementation: virtual or materialized or some combination of the two. Which to select? Again, as with most every architecture question, the answer is “it depends.”
The difference between virtual and materialized comes down to when the calculation, association, and transportation are done. The key decision factors are expected resource consumption and performance requirements.
In a virtual data architecture, complexity is abstracted behind a view, and association, calculation, and federation are accomplished at execution time.
Let me start by saying that “virtual” is a bit of a misnomer. If you have a suggestion for a better term, please let me know. After all, it would be logical to consider access through views as the distinguishing characteristic of a virtual data architecture, but that’s not the case. Some repositories (e.g. Teradata), recommend that all data be accessed through views, even when the view is defined as “SELECT * FROM table.”
The differentiator isn’t the view itself, but rather what happens within the view. Common metric calculations, table associations, and remote table access are incorporated into the view. Those details are abstracted from the user, and as a result, they don’t have to type them in every time (or click or drag or you know what I mean).
Factors to consider when contemplating a virtual data architecture:
Pros:
- Abstracted complexity
- Definitional and implementation consistency
- Access to the most currently available the data
- No need for additional ETL processes
Cons:
- Potentially unexpectedly enormous resource consumption
- Database optimizers often have a hard time efficiently executing queries consisting of complex views
- Sub-optimal performance
Since the work of calculating, associating, and accessing the data is hidden from the user, and must be repeated every time the query is executed, the potential for query execution cost sticker shock is very high. What looks like a simple query accessing a single object could end up consuming most of this year’s cloud budget (and some of the next). After all, the consumer doesn’t necessarily have any insight into the resources required. And why should they? The whole point is to hide that complexity.
Nevertheless, it’s still important for the users to be aware of what they’re about to do. Perhaps when a virtual data architecture is employed, the user should be given a rough idea of cost and/or complexity prior to execution.
Lastly, in some companies, the only available metadata is the table name. When there’s no other option, consider including some indication of complexity in the table name. I’m not a big fan of embedding metadata in this way, but sometimes you have to do the best you can with what you have to work with.
In a materialized data architecture, complexity is abstracted behind a dataset, and association, calculation, and federation are accomplished prior to execution time.
Materialized Data Products can be thought of like materialized views or data mart tables in a traditional analytics architecture. The metrics are calculated, the tables are joined, and data is consolidated in anticipation of future use.
Factors to consider when contemplating a materialized data architecture:
Pros:
- Abstracted complexity
- Definitional and implementation consistency
- Performance and resource consumption optimization
Cons:
- Generally does not access the most currently available copy of the data
- Requires additional ETL processes
- Process ownership assignment and overhead
Materialization is all about the trade-off between performance optimization and resource consumption at the expense of process overhead and data replication. Performing complex calculations, associating datasets, and moving data, especially across the wide area network, takes time and consumes resources. Materialization is particularly useful when consumers repeatedly employ the same calculations and/or access and associate the same datasets (especially when they are federated). Assuming you’ve chosen wisely, the initial investments will quickly pay dividends in lower resource consumption.
Materialization, of course, comes at the cost of replicated data, additional processes, and the assignment of people to support and maintain those processes. Securing organizational ownership can also be a challenge.
A modern analytics architecture is likely to include both virtual and materialized data, as well as federated and consolidated repositories. It is the responsibility of the enterprise analytics team to understand the trade-offs and constraints along each dimension to come up with the best combination.
Someday, and I believe it will be sooner rather than later, artificial intelligence engines will analyze resource utilization, determine the optimal topology, and automatically locate and aggregate the data. We’ve already seen this kind of utilization analysis and optimized data placement in advanced relational database engines. It’s not a stretch to imagine it on a wider scale across the enterprise analytics estate. I’m really looking forward to seeing that one.