Workload Mobility in a Service Mesh World
Service mesh has been on an interesting journey (service mess joke here). If you look back at its history, you’ll see a huge focus on the “zero-trust security” aspects of mesh. Effortless, automatic encryption between services has a great ring to it. We build on the foundation that the micro-segmentation vendors laid by telling a story of securing service-to-service communication. For my own team, this was especially important, since we’ve chosen to focus on a multi-runtime use case (Kubernetes, virtual machines, other container workloads, etc.). Being able to deliver a “platform” that enables that security functionality out of the box, without the complexities of overlay networking, feels like a great path.
The thing about it, though, is that the zero-trust security story is a fairly well-understood problem in the service mesh space. I’d argue that, with the exception of a few specific features that have come out along the way, the true differentiation is really just the user experience at consuming service mesh in this way. So, beyond zero-trust, what’s the next chapter in service mesh?
Same Problem, Different Data Center
If there’s one consistent trend that’s followed me through my tech career, it’s that there’s always a migration. It’s either happening now, or it’s happening soon. In the early days, it was migration between application stacks and life-cycling an application for the next version or a new version entirely. From there, it was data center migrations and failover scenarios. Then, it became migrating to the cloud (which is just someone else’s data center, anyway). Somewhere mixed in is moving from hardware to virtual machines, to containers, to Kubernetes and whatever else is next.
We started this journey building our “house” on a secure foundation. This foundation allowed us to ensure that workloads that lived within our walls were secure with minimal friction. That security comes by way of the automatic mTLS capabilities within the service mesh and traffic permission policies that allow us to control which traffic we pass through to services. With this foundation in place, we can start to return to another constant – creating workload mobility.
Creating Environment Consistency
The challenge we’ve always seen with workload mobility is the lack of consistency between environments. Sometimes this inconsistency is created by the administrative teams that own the platforms. Other times, it’s more architectural, like how the environment was designed. Inconsistencies make migrations a lot harder. Conversely, suppose you’re migrating workloads from an on-premises Kubernetes cluster to a cloud-based Kubernetes service. In that case, your migration is much simpler because the runtime platform is consistent between the environments.
One of the understated outcomes of a service mesh is establishing a shared control plane between environments that are a part of that mesh. We can use this shared control plane to act as a consistency layer between environments that are often anything but consistent. This control plane allows us to apply policies (more on that in a bit) that, in turn, are translated to all the workload platforms that have joined the control plane. In the case of Kuma (the Envoy-based service mesh donated to the CNCF), this includes workloads that operate across several different runtime environments such as bare metal, virtual machine, Kubernetes and even platforms like Amazon ECS.
Speaking directly to workload migration – if we’ve created a commonplace for communication-based policy to live, we can use that control plane to orchestrate the way traffic routes between these environments. These policies are most commonly referred to as traffic routing policies and include such concepts as routing, splitting and failover policies between environments.
Progressive Delivery, Migration and Controlling Flow
Stopping the flow of a river and changing its path is actually a pretty challenging task. Conversely, if you can start water flowing down a new path, gradually allowing more water to flow down that second channel makes it easier to “migrate” to the new stream. This same concept applies to workload migration and mobility.
In service mesh, workloads register themselves via some form of service discovery mechanism – and make the control plane aware of the paths that exist to get to them. From here, we as operators or developers can use that control plane to influence the routing decisions. A practical example of this might be that api_default_svc_5000 represents an API service, but ultimately two sets of services have registered to that name, with different metadata. The service mesh can use that metadata, combined with a traffic routing policy, to influence the rate of traffic between the individual applications registered to that service name.
Applying this logic to workload mobility, specifically moving from virtual machine to Kubernetes environment, this example holds true. The traditional virtual machine-based application might register as apitier_default_svc_5000, with a metadata tag of kuma.io/vm-env, while the Kubernetes-based cluster registers itself with a metadata tag of kuma.io/kube-env. In this case, we can leverage the service mesh to migrate connectivity between these environments until the control plane is directing all traffic to the new location in Kubernetes.
From a progressive delivery standpoint (shout-out to James Governor at Redmonk), we would start this migration slowly, allowing only a small amount of traffic initially and gradually ramping it up. We take this approach to limit the blast radius of service change in the environment. As the service mesh is acting as our central control point, we have a common control center to manage this “flow” and back out the changes (blocking the channel) as needed.
Establishing service mesh in this pattern provides you a different foundation than zero-trust security. It gives you a foundation of self-service management of traffic routing. We can use this model to create intelligent failover scenarios, migrate workloads or enable cross data center traffic patterns. Historically, these were problems solved by the network team in complex routing and switching patterns. Service mesh evolves this model by providing an API-driven, self-service way to manage application connectivity.
The Next Generation of Service Mesh
Zero-trust in service mesh is quickly becoming table stakes. When we look at common problems that exist in customer environments, there are ample struggles in the realm of managing the traffic patterns that exist. Beyond these existing traffic patterns we see new avenues into observability and how we connect environments. This need will continue to grow as users look to solve new versions of old problems in workload migration and mobility.