Reducing Kubernetes Costs With Chaos Engineering

September 14, 2021September 14, 2021 Andre Newman chaos engineering, KubeCon, kubernetes, Kubernetes cost

Managed Kubernetes clusters make it easy for teams to quickly deploy and run containerized workloads at scale. The pay-as-you-go pricing model of cloud Kubernetes is great for scaling workloads, but the costs can quickly grow as your cluster grows.

Many Kubernetes cloud providers will cover the costs of running the Kubernetes control plane, which leaves consumers to pay for the worker nodes and other cloud resources. The benefit is that consumers have direct control over the price of the cluster by adjusting node capacity and node count. The downside is that if left unchecked, these costs can quickly add up to an unexpectedly large bill.

Let’s look at some of the top cost drivers for managed Kubernetes services and how you can reduce them using chaos engineering.

What is Chaos Engineering?

Chaos engineering is the science of performing intentional experimentation on a system by injecting precise and measured amounts of harm. This allows you to observe how the system responds for the purpose of improving its resilience. The goal of chaos engineering isn’t to create chaos, but to prepare our systems to withstand conditions that would otherwise cause degraded performance, instability or failure.

Think about what happens in a Kubernetes cluster when a worker node is shut down. We’d expect Kubernetes to shut down any pods running on the node and schedule them somewhere else. But how confident are we that this will actually happen? Have we tested it before? Do we know how long it takes to shut down existing containers and spin up new copies? Can we redirect traffic to replicas running on other nodes without interrupting users’ sessions? Do we even have enough replicas running on other nodes to handle this extra load while the new replicas start?

Sponsorships Available

To answer these questions, we could just wait for a node to shut down automatically in production, but that’s risky. Instead, we can use chaos engineering to run this as an experiment in a safe and controlled way so that we can validate the behavior of our Kubernetes cluster even before we deploy to production.

But how does this translate to cost savings? Some of the biggest cost savings we can achieve on Kubernetes involve:

Provisioning the smallest number of nodes with the fewest computing resources possible.
Optimizing our deployments and using techniques like autoscaling to make the best use of available resources.
Using low-cost deployment models and services, like preemptible virtual machines.

Here’s how chaos engineering helps with each of these.

Capacity Planning/Right-Sizing

The biggest cost of a cluster involves its size. The more nodes in your cluster and the more computing resources allocated to each node, the more expensive it is. Managed Kubernetes services give you near-complete control over capacity but also expect you to generally know your requirements. This includes:

How much total CPU and RAM your containers need and how much overhead you want to leave for traffic spikes.
High availability (HA) requirements, including pod replicas.
Communication with external services like storage, DNS and load balancing.
Access to node-specific hardware like GPUs.

When determining your capacity requirements, start by looking at how much CPU, RAM and disk space your production applications are currently using. This will give you your minimum required capacity. Kubernetes will automatically balance pods across nodes based on available resources, but there may be situations where two resource-intensive pods are scheduled to the same node—for example, multiple CPU-heavy Pods may be co-located on the same node, so be prepared for situations like this by adding a little extra capacity. Multiply this by the number of replicas you want to have running (both for redundancy and scaling purposes), add extra overhead for additional replicas and traffic growth and this will give you a good starting cluster size.

Of course, you might not need expensive, high-capacity nodes. If your workloads run fine even when nodes are close to their capacity, then you could get away with less expensive nodes. Before downscaling, you’ll want to make sure your workloads can tolerate a small overhead window. For that, we can use chaos engineering.

Using Chaos Engineering to Plan Capacity

With chaos engineering, we can simulate the effects of a crowded or lower-capacity worker node. We can do this by using resource experiments to stress-test CPU, RAM and disk usage. For example, we could run a CPU experiment to consume 80% of the total CPU capacity on a node that our application is running on, then monitor our application for any negative effects. If there are no noticeable slowdowns, no crashes and no errors, then we know CPU isn’t a constraint and we can probably choose a less expensive node.

Likewise, if we run a RAM (or memory) experiment that uses 20% to 50% of all available memory on a node and none of our pods are evicted or crash, then we can confidently reduce the memory capacity of our cluster and save on our monthly bill. Disk capacity is a slightly different story—since we’re probably using a separate storage solution instead of our worker disks—but we can still use chaos engineering to prepare by using a disk experiment to consume a certain amount of storage space. For example, if one of our applications is a database and is expected to grow by 100GB within the next month, we can use a disk experiment to consume 100GB in our storage solution and ensure that it can handle that amount of data. If not, we may need to revisit our storage solution or consider another solution like an intermediate cache.

Autoscaling

Autoscaling is an automated process that increases the number of resources available to an application or to a cluster. Kubernetes supports two types of autoscaling: Horizontal and vertical.

Horizontal scaling is when a pod is replicated, resulting in two identical pods. This adds redundancy in case one pod fails while also letting you deploy the second pod to another node to access its resources. This is handled by the horizontal pod autoscaler (HPA), which is built into Kubernetes and uses Metrics Server to compare the resource usage of a pod against configurable thresholds. If the pod’s usage passes that threshold, then HPA automatically provisions and deploys a new instance to reduce the load.

Vertical scaling is when an additional worker node is added to the cluster. This increases the total amount of resources available to all workloads. Vertical scaling is typically handled by the cloud provider that the cluster is running on, but also usually relies on metrics reported by Kubernetes to monitor resource usage. Like with horizontal scaling, we can set thresholds for when we should scale, how much we should scale by and the maximum and minimum number of instances we want provisioned at any one time.

Autoscaling is closely aligned with right-sizing. When we right-size our cluster, we’re optimizing it for our specific workloads and demand expectations. But demand can fluctuate wildly—just ask any online retailer during Black Friday. If we can’t respond to these fluctuations quickly, we can end up with either too much capacity (overpaying) or too little capacity (slow performance, potential failures and angry customers).

Validate Autoscaling Using Chaos Engineering

Autoscaling is a response to changes in demand. In the case of Kubernetes, our HPA monitors pod resource consumption to determine when to scale horizontally, while our cloud provider monitors cluster resource consumption to determine when to scale vertically. With chaos engineering, we can deliberately increase resource consumption to trigger our autoscaling rules and validate that they work as intended.

With horizontal autoscaling, we can consume resources on a per-pod basis. We can:

Define the autoscaling thresholds for our ReplicaSet.
Run a chaos experiment to consume CPU, memory or disk space for a specific pod.
Monitor the HPA to ensure a new pod gets provisioned and successfully starts.

Likewise, with vertical scaling, we can repeat this process on a node to ensure that our cloud platform successfully adds a new node to our autoscaling group. This is important for making sure that new nodes can be allocated, set up and connected to the cluster automatically. This is also important for scaling down: When the experiment ends and we’re no longer consuming resources, we should see consumption return to normal levels and the instance gets automatically deleted when it’s no longer in use.

Preemptible Instances

A preemptible instance is a cloud compute instance that can be terminated and repurposed by the cloud provider at any time. Preemptible instances are provisioned from extra, unclaimed capacity that the provider has available. Unlike a regular instance which continues running until you terminate it, preemptible instances can be stopped if the capacity is required, giving you just a few minutes to migrate your workloads and data off of the instance before it’s destroyed.

Note that we’re not using “preemptible instance” in reference to any specific product or service. It includes preemptible instances on Google Kubernetes Engine, Spot instances on Amazon EKS and Azure Spot on Azure Kubernetes Service.

The main benefit of preemptible instances is that they’re usually much cheaper than regular instances, sometimes as much as 90% less. The tradeoff is there’s no guarantee that they’ll be running when you need them. This is especially true if you have long-living or persistent workloads like databases, where losing a node unexpectedly could mean data loss or noticeable interruptions in service. Therefore, preemptible instances work best when:

You have short-lived workloads, or
You have non-persistent workloads that you can reschedule on another node or
You can automatically and reliably migrate data from a terminating node to another node.

Preparing for Preemptibility with Chaos Engineering

How do you know if your applications are ready for preemptible instances? You can use chaos engineering to simulate the behavior of a preemptible instance, monitor your application and use your observations to determine whether you’re ready to deploy to a preemptible instance. Since preemptible instances terminate at essentially random times, we can use a shutdown experiment to issue a shutdown command to the instance.

Different Chaos Engineering tools can even schedule experiments to run at unspecified times, recreating the unpredictability of a preemptible instance termination. This way, we can recreate the conditions of a preemptible environment on our own terms, test our applications, and deploy preemptible instances to production with confidence.

To hear more about cloud-native topics, join the Cloud Native Computing Foundation and cloud-native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021