Listen to this article:
In an earlier article, I highlighted the approach that some companies are taking of allowing, and in some cases encouraging, uncontrolled data replication.
Replication may be an intentional executive strategy, although experience suggests it might not be the smartest strategy. Replication may also be a byproduct of repository selection. File-based repositories, especially the raw object stores used in most Data Lakes, struggle with WHERE clauses and joins. So instead, specific result sets are stored. Files for each report, dashboard, data mart, and application. And why not? Cloud processing is expensive and disk space is practically free and besides, my results are already there when I need them. We’ve already seen why not.
While replication usually begins as a way to accelerate deployment and reduce costs, it eventually fails to sustain either.
This week we’ll look at what we can do about it.
In many cases, the answer will appear to be “nothing,” or at least very little. We might see the challenges that accompany uncontrolled replication, but the inevitable early successes will very visibly contradict our warnings.
This leaves us with two choices. The first is to salute and execute. Let the pieces fall where they may. Of course, we’ll have to clean up the mess later. Don’t expect credit for having foreseen the problem. You may have been correct, but at most companies it’s career-limiting to remind management of the fact.
The second is to:
Look for opportunities to influence architecture and process in ways that limit the damage, even if it is only around the edges for the time being.
Here are a few you can start with.
1. Sandbox Management
A mature enterprise analytics architecture will include independent repositories created as sandboxes for exploring new use cases and prototyping. This is a good thing. Accommodate them. Plan for them. Encourage them. But be sure to manage them and to isolate them.
Do not allow sandboxes to become permanent and do not allow sandbox content to become public.
The processes to request and create sandboxes should be as fast and as frictionless as possible. The same for promotion to production, but proper curation is required. (If you’ve got ongoing uncontrolled replication you probably aren’t pursuing a Data Product strategy, so curation, at minimum, must include definitions, expected content, authoritative sources, and security and privacy requirements.) Of course, demand for even the slightest curation will probably be considered an insurmountable obstacle with the attendant crying, wailing, and threats of missed deadlines. Hold your ground if at all possible.
In response, many teams will say, “Fine, you need curation to move this into production. I don’t want to take the time to do that so I’m just going to continue to use it as it is, where it is.” This is where having hard sandbox expiration dates becomes critical. Otherwise, you’re going to have uncontrolled sandbox replication. Not an improvement.
Establish the expiration date when the sandbox is created, and delete it on that date. Automate the process so nobody has to remember to do it, or risk being talked into an extension. You’re going to find out how committed your management is to a rational information architecture the first time somebody wants a extension. And then another. And another. And another.
Sandbox content must also be isolated. Even if the corporate data bloodstream is contaminated with uncontrolled data from everywhere else, you’re just trying to keep it from getting worse. Isolation also prevents these sandboxes from becoming permanent. As soon as one downstream process becomes dependent upon sandbox content, the sandbox becomes infinitely more difficult to expire. That’s often the point. It’s called “burrowing.” Increase dependency to ensure permanence. Recognize that objective and don’t let it happen.
2. Foundational Data Products
Even in an uncontrolled environment, Data Products can be incrementally introduced to begin to improve consistency. I’ve talked a lot about Data Products elsewhere so I won’t spend much time here. Suffice to say, start with Foundational Data Products to establish a clean layer of standardized, validated data. Introduce quality measurement to increase the reliability. More importantly, begin to demonstrate the benefits of proper management and draw a contrast between the controlled and uncontrolled data.
Next, layer on canonical metrics in Composed Data Products with shared definitions to prevent conflicting numbers. Emphasize to the development teams that:
Everybody can build their own pipelines, but nobody can define their own meaning of data.
Of course, none of this will work with heavy, centralized approval processes. Leverage existing distributed processes as much as possible. Automate as much as possible. Make doing things the right way the easy way.
3. Communication
I’ve said it many, many times:
The best and sometimes only leverage that an enterprise analytics team has is communication.
Identify the intersection of metrics that will resonate and metrics that will drive progress toward more rational analytics environment management.
Perhaps display on your website counts of the core, application, and user tables/files as well as the disk space consumed. Make the costs visible. Be sure to include the people required by each team to manage and support the repositories and pipelines. After all, each team may only be consuming a relatively small amount, but it all adds up. Make the list of data tickets/issues/questions public. Quantify the expense associated with data reconciliations and delayed delivery through after action investigations. Stop wasting money without even knowing it. At least know it.
Now, you’re going to get pushback. Expect it. Be prepared for it. Nobody likes having their sins exposed.
When you start publishing metrics, don’t make it personal. Just give aggregate results. Just show the totals.
Maintain the detail behind the scenes but don’t publicize it. Eventually, somebody’s going to want to see the results by organization or individual. You can reluctantly agree. After all, you don’t want to call anybody out specifically. You just want what’s best for the enterprise. But if you want that information out there, then I will do that for you.
Many organizations approach establishing this kind of communication from the other direction, with individuals and departments on Hall of Fame or Hall of Shame rankings. Starting this way increases the likelihood that you’re going to get shut down right out of the gate by a disgruntled Hall of Shame inductee who complains to management.
4. Reward Sharing
This also falls under the heading of communication, but deserves its own bullet point. Most companies incentivize empire-building. Create the new widget and you get kudos and chicken lunches. Leverage something someone else built, delivering in a fraction of the time, and it goes unnoticed.
Do not reward the creation of something new before asking why something existing wouldn’t work.
Incentivize reuse. Incentivize sharing. Incentivize contributing improvements. If the culture rewards speed above all else, you’ll get uncontrolled replication no matter how good the architecture is.
Expose duplication and promote sharing.
All of this requires culture change, and that may be the hardest thing of all to influence. Do what you can (which is what you’re already doing). Set a good example. Decentralized execution requires centralized standards and a culture that enforces discipline. Unfortunately, most organizations don’t have either. In fact, most have the opposite: decentralized standards and no data discipline.
Analytical replication driven purely by demand drifts into chaos.
Teams need autonomy, but within governed systems. Without it, you’re wasting money and buying yourself a future reconciliation and consolidation project. Get it right the first time.
Decentralization has been successful in organizations where replication happens at the edges, but the core data is still standardized. Duplication is intentional, not accidental or reflexive. Inconsistency is identified, described, and quantified, and never simply ignored or tolerated.