Chaos Engineering for Stateful Kubernetes

Stateful Kubernetes (K8s) is getting market traction. According to the latest CNCF survey, 55% of respondents use stateful applications in containers in production. Another 12% are evaluating them, and 11% plan to use them in the next 12 months. Those numbers predict a promising future for stateful containerized applications. The drivers of this increased market adoption are K8s’ simplicity and resiliency as well as its capabilities with regard to the ephemerality of stateless applications.

To level the playing field, StatefulSets and container storage interfaces (CSI) were created. Using these tools and others, K8s is ready for stateful applications. Well, almost. You see, storage is a challenge. Storage is complicated, geo-restricting, and vendor-defined. It requires infrastructure, and wants you to learn how to operate it.

But most people who choose K8s do so for other reasons, like the easily deployed stateless method, and they have more pressing concerns than learning the wonders of storage.

As with other platforms, if there is a demand, solutions will quickly follow. So, in the last few years, the market adopted managed services and vendor-agnostic, software-defined storage (SDS) solutions to enable stateful applications to gain access to data, taking it out of the container’s ephemeral life cycle and making it available (highly available, if possible) and resilient. But is your K8s stateful deployment ready for production?

You Can Run Stateful Containers! Congrats!

If you’re running a single cluster on a single cloud on a single region, give yourself a pat on the back; you don’t need to read any further, and I wish you all the best.

But if the words ‘vendor lock-in,’ ‘application mobility,’ ‘data gravity,’ ‘multi-cluster,’ ‘cross-region’ or ‘sync-replication’ exist on your road map, you should keep reading, and make sure your stateful applications are ready to run anywhere, anytime.

Orchestrating Chaos For Good

When I first learned about chaos engineering, it just made sense. Due to an increase in the layers of abstraction as cloud services stack on top of each other, it’s increasingly difficult to predict every possible scenario that might affect operations. Since you can’t predict the future, chaos engineering principles to test infrastructure failures, network failures, and application failures just makes sense.

What is Chaos Engineering?

Chaos engineering is the process of experimenting on a software system in production to build confidence in the system’s ability to withstand failures, condition changes and changes in different services. It’s not chaotic in the least; in fact, chaos engineering is a monitored experimental process that has clear goals, including:

  • Identifying systemic weakness on production environments before they occur
  • Testing fallback mechanisms like replication
  • Preparing for failure
  • Building confidence in a system to assure service-level agreements (SLAs) are met
  • Understanding dependencies
  • Predicting performance consistency
  • Creating a recovery process and policies

You are probably asking yourself, “Why? I can create a resilient cluster, add a couple more pods for resiliency’s sake and K8s will take care of the rest!” This is mostly true for stateless applications, but stateful applications require testing for persistent storage, networking and replication (and, preferably, data integrity and consistency) to gain the same level of application resiliency assurance.

Production Grade Stateful Kubernetes

Kubernetes is probably the best container orchestration system you can use, and there are decent Kubernetes storage solutions and integrations out there. However, when it comes to stateful distributed applications, a lot can go wrong. To truly know if you have a production-grade Kubernetes deployment, you must test it live. Using chaos engineering to test not only monitors and logs how your system reacts to failures, but it does so randomly; you’re ‘blindfolded,’ and since you don’t know what will break next, it mimics the real-life scenarios and prepares you for anything.

To implement these kinds of tests, there are many new tools available, like Chaosk8s, Chaos-Mesh, and the Litmus framework. These tests are widely adopted across the industry, and can assist you with facilitating and monitoring chaos engineering experiments.

The Goal: Sleeping Well at Night

A wise database analyst once told me, “A backup is only truly reliable once you’ve successfully used it to restore your database.” The same goes for storage, orchestration and other high-availability solutions you’d use to keep your system up and running. Even if you are using the industry’s leading infrastructure solutions and services, the only way to be absolutely confident they work and keep your production environment alive is to frequently test them. Applying chaos engineering will dramatically increase the variety of test scenarios and decrease the potential for encountering unexpected, unknown risks. It also will decrease the chances that a midnight page will disturb your sleep.

Uri Zaidenwerg

Uri Zaidenwerg is the DevOps Lead at Replix, a multicloud data service for Kubernetes. A seasoned DevOps engineer, Uri’s expertise covers various stacks of applications along with a deep strategic understanding of technology and processes, both in development and production. He is an alumnus of the Israeli military’s elite Center of Computing and Information Systems unit, MAMRAM.

Uri Zaidenwerg has 1 posts and counting. See all posts by Uri Zaidenwerg