Chaos Engineering for Stateful Kubernetes

April 6, 2021April 6, 2021 Uri Zaidenwerg chaos engineering, container security, kubernetes, resilience, stateful kubernetes

by Uri Zaidenwerg

Stateful Kubernetes (K8s) is getting market traction. According to the latest CNCF survey, 55% of respondents use stateful applications in containers in production. Another 12% are evaluating them, and 11% plan to use them in the next 12 months. Those numbers predict a promising future for stateful containerized applications. The drivers of this increased market adoption are K8s’ simplicity and resiliency as well as its capabilities with regard to the ephemerality of stateless applications.

To level the playing field, StatefulSets and container storage interfaces (CSI) were created. Using these tools and others, K8s is ready for stateful applications. Well, almost. You see, storage is a challenge. Storage is complicated, geo-restricting, and vendor-defined. It requires infrastructure, and wants you to learn how to operate it.

But most people who choose K8s do so for other reasons, like the easily deployed stateless method, and they have more pressing concerns than learning the wonders of storage.

As with other platforms, if there is a demand, solutions will quickly follow. So, in the last few years, the market adopted managed services and vendor-agnostic, software-defined storage (SDS) solutions to enable stateful applications to gain access to data, taking it out of the container’s ephemeral life cycle and making it available (highly available, if possible) and resilient. But is your K8s stateful deployment ready for production?

You Can Run Stateful Containers! Congrats!

If you’re running a single cluster on a single cloud on a single region, give yourself a pat on the back; you don’t need to read any further, and I wish you all the best.

Sponsorships Available

But if the words ‘vendor lock-in,’ ‘application mobility,’ ‘data gravity,’ ‘multi-cluster,’ ‘cross-region’ or ‘sync-replication’ exist on your road map, you should keep reading, and make sure your stateful applications are ready to run anywhere, anytime.

Orchestrating Chaos For Good

When I first learned about chaos engineering, it just made sense. Due to an increase in the layers of abstraction as cloud services stack on top of each other, it’s increasingly difficult to predict every possible scenario that might affect operations. Since you can’t predict the future, chaos engineering principles to test infrastructure failures, network failures, and application failures just makes sense.

What is Chaos Engineering?

Chaos engineering is the process of experimenting on a software system in production to build confidence in the system’s ability to withstand failures, condition changes and changes in different services. It’s not chaotic in the least; in fact, chaos engineering is a monitored experimental process that has clear goals, including:

Identifying systemic weakness on production environments before they occur
Testing fallback mechanisms like replication
Preparing for failure
Building confidence in a system to assure service-level agreements (SLAs) are met
Understanding dependencies
Predicting performance consistency
Creating a recovery process and policies

You are probably asking yourself, “Why? I can create a resilient cluster, add a couple more pods for resiliency’s sake and K8s will take care of the rest!” This is mostly true for stateless applications, but stateful applications require testing for persistent storage, networking and replication (and, preferably, data integrity and consistency) to gain the same level of application resiliency assurance.

Production Grade Stateful Kubernetes

Kubernetes is probably the best container orchestration system you can use, and there are decent Kubernetes storage solutions and integrations out there. However, when it comes to stateful distributed applications, a lot can go wrong. To truly know if you have a production-grade Kubernetes deployment, you must test it live. Using chaos engineering to test not only monitors and logs how your system reacts to failures, but it does so randomly; you’re ‘blindfolded,’ and since you don’t know what will break next, it mimics the real-life scenarios and prepares you for anything.

To implement these kinds of tests, there are many new tools available, like Chaosk8s, Chaos-Mesh, and the Litmus framework. These tests are widely adopted across the industry, and can assist you with facilitating and monitoring chaos engineering experiments.

The Goal: Sleeping Well at Night

A wise database analyst once told me, “A backup is only truly reliable once you’ve successfully used it to restore your database.” The same goes for storage, orchestration and other high-availability solutions you’d use to keep your system up and running. Even if you are using the industry’s leading infrastructure solutions and services, the only way to be absolutely confident they work and keep your production environment alive is to frequently test them. Applying chaos engineering will dramatically increase the variety of test scenarios and decrease the potential for encountering unexpected, unknown risks. It also will decrease the chances that a midnight page will disturb your sleep.