Big Little Lies in Persistent Data Orchestration

October 14, 2020October 13, 2020 Douglas Fallstrom data orchestration, data storage, kubernetes, metadata

When vendors scramble to solve intractable technology problems, they sometimes make claims that have, let’s say, a tenuous relationship with reality. Take the subject of data orchestration for container-based applications. As more developers use Kubernetes, they keep hitting a wall when it comes to working with persistent data at scale. Storage vendors claim they have a solution. Pull back the curtain, though, and what they’re proposing doesn’t make much sense.

The truth is, the biggest barriers to data orchestration in Kubernetes can’t be solved by new storage solutions. We need to fundamentally reimagine how we manage data—how we think about data itself.

The Problem of Persistent Data

Kubernetes has grown so quickly for a simple reason: It makes developers’ lives easier. By building applications as collections of lightweight containers that can be instantiated anywhere, developers no longer have to think about IT infrastructure. Kubernetes handles all the details—instantiating application resources, protecting them, scaling them, doing everything else needed to run in an IT environment—without developers having to spell out how.

It’s a great model for ephemeral applications that spin up some compute, do something and then go away. But, what about applications that work with persistent enterprise data—the kind you do need to back up, secure and ensure it meets compliance requirements? There’s no good way to bring persistent data into the Kubernetes world—at least, not without forcing developers to learn all the gory details of how storage technologies work.

To get persistent data to work like compute—so it’s just there when applications need it, with all the right IT considerations automatically baked in—we need to solve two big problems:

Sponsorships Available

Applications are portable, but data is siloed. In Kubernetes, compute can be treated as simple, portable and disposable, but persistent application data cannot. “Data has gravity,” to use a cliché. There are all sorts of requirements around it, and it’s usually kept in a storage silo—one vendor’s proprietary infrastructure, which doesn’t play well with other vendors’ infrastructures by design. So, even if the rest of your application is portable, your data is not. You still need specialized knowledge of those proprietary storage infrastructures to get to it and then work with IT to move it.
Performance, reliability and manageability suffer at scale. In large production environments, you can scale up compute from tens to tens of thousands almost effortlessly. But, to feed all that compute as it scales, your storage needs to copy, manage and protect application data just as quickly. Your application ends up handcuffed to your storage infrastructure and it just can’t keep up.

Storage Vendors to the Rescue?

It’s only natural for the industry to turn to storage vendors to help solve this problem. They’re the data experts, right? So far though, they haven’t cracked the code. Worse, they make claims about the role of storage in Kubernetes that don’t stand up to scrutiny. Claims such as:

“You need purpose-built storage for containers.” Up to now, storage vendors’ solutions to the data gravity problem have always involved—you’ll never guess—buying new storage infrastructure. Vendors say this “container-native” storage integrates seamlessly with Kubernetes. But, developers still need to understand and code for how these specialized infrastructures work. In a container-based world, you should be able to think about storage the same way you think about compute: It should be disposable, unnamed and anonymous. If you have to think about a vendor’s storage infrastructure at all when building applications, you’re missing the whole point of Kubernetes.
“New storage solutions provide data orchestration.” Storage vendors like to talk about “data orchestration,” but it’s almost always just a euphemism for storage. The reality is, storage infrastructure doesn’t scale well. The latest techniques for provisioning storage may be better than they used to be, but they’re still fundamentally about managing infrastructure, not orchestrating data.
“Storage vendors are the only ones who can solve this problem.” Here, we get to the core of the issue. Storage vendors have spent lots of energy retrofitting their solutions for containers. Fundamentally though, they’re still stuck in a model of replicating volumes and managing silos. To solve the data gravity problem, we need to find a way to orchestrate data that doesn’t depend on any particular infrastructure. Yet, we’re asking companies whose entire businesses are built around selling specialized storage infrastructure to figure that out for us.

Set Data Free

The reason we’re still grappling with the persistent data problem in Kubernetes is that we’ve been thinking about it all wrong. This is not a storage problem. It’s a data problem. To solve it, we need to turn to the same model that Kubernetes uses to free applications from compute infrastructure. We need to untether data from storage infrastructure. This entails:

Disaggregating metadata: If data’s gravity is stopping us from making it available everywhere easily, what can we use instead that has no mass? The answer is metadata. With a very small amount of information, you can describe and present all your enterprise data—petabytes of it. That lightweight metadata can move anywhere, just as quickly as compute. Just as important, it can stay synchronized everywhere, all the time, without having to constantly copy data volumes. Suddenly, your data is portable. And the data gravity problem disappears.
Orchestrating data: Once you‘ve disaggregated metadata from data, you can use that metadata to present a single, global file system to applications. That file system can span geographies, technologies and vendors. Data stored in the cloud, on-premises, with a service provider, in any vendor’s storage—it all now looks like one logical system to applications. Data management works entirely through metadata, independent of the underlying infrastructure. And, you can manage your data at the level of files instead of volumes. You can do real, legitimate data orchestration, not just storage provisioning.
Tying enterprise data services to data. Once you can manage the control plane (the metadata) and data plane (the data itself) independently, the system can do all sorts of smart things in the background. It can move just the right amount of data where it needs to be to serve an application, automatically take actions to meet compliance requirements and ensure reliability and availability. All your IT data management services are now defined within and tied to the metadata—not whichever storage appliance the data happens to live on at any given moment.

By taking these steps, we can break the handcuffs keeping persistent application data shackled to IT infrastructure. Metadata can now go anywhere applications go, just as quickly. Data management services are now tied to the data, not the storage it’s living on and automatically happens without developers having to worry about how.

If all this sounds familiar, it’s because that’s just how Kubernetes works for everything else. It’s time to bring the same speed, simplicity and automation to data.