3 CNCF Tools For Cloud-Native Chaos Engineering

When software is released into the world, it takes on a life of its own. It’s often hard to predict how people will use it—it’s even harder to predict how people will abuse it. One could say that the only thing you can count on is … chaos!

Chaos engineering is when engineers intentionally put their software systems through the wringer. This can be a great way to test how your systems respond to unforeseen events. For example, you might fill your APIs with malformed requests to see what fails. Or, perhaps you push your server resources to the very limit. You could introduce latency, detach core dependencies or throttle your site with high traffic surges to see what crashes.

Vigorous testing is crucial for cloud-native architectures and microservices-based applications running on platforms like Kubernetes, where infrastructure is supposed to dynamically recover after failure. And thankfully, there are some great tools out there to help you invoke chaos engineering without altering your production state. Below, we’ll look at three open source chaos engineering projects hosted by CNCF that you can use to quickly run experiments on your cloud-native architecture.

Chaos Mesh

A chaos engineering platform for Kubernetes.

Website | GitHub

Want to test the limits of your Kubernetes deployment? Look no further than Chaos Mesh to perform chaos engineering on your production Kubernetes clusters. Chaos Mesh is easily deployable as a CustomResourceDefinition (CRD), so you can get started quickly.

curl -sSL https://mirrors.chaos-mesh.org/v2.3.0/install.sh | bash

Using Chaos Mesh, operators could perform fault injection on the network, disk, file system, operating system and other areas. Experiments can either be created in a user-friendly GUI or initiated using a YAML file.

For example, you could use Chaos Mesh to simulate a stress test inside containers. This configuration below defines a sample StressChaos experiment to continually read and write, draining up to 256MB of memory. Fields could be easily changed to adjust the duration, pod, size, and other factors.

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-example
  namespace: chaos-testing
spec:
  mode: one
  selector:
    labelSelectors:
      'app': 'app1'
  stressors:
    memory:
      workers: 4
      size: '256MB'

What’s cool is that you can use Chaos Mesh to schedule cyclical testing behaviors. For example, this snippet in YAML from the documentation demonstrates how to configure Chaos Mesh to continually perform a NetworkChaos experiment five minutes after every hour. This particular experiment produces a network latency fault with a 12-second duration.

apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: schedule-delay-example
spec:
  schedule: '5 * * * *'
  historyLimit: 2
  concurrencyPolicy: 'Allow'
  type: 'NetworkChaos'
  networkChaos:
    action: delay
    mode: one
    selector:
      namespaces:
        - default
      labelSelectors:
        'app': 'web-show'
    delay:
      latency: '10ms'
    duration: '12s'

Using Chaos Mesh, there’s no need to change your deployment logic to perform chaos experiments. You can observe the behavior in real-time and, if it’s really going haywire, you can quickly roll back failures. The platform also supports RBAC as well as blacklisting and whitelisting to help protect the experimentation process itself from abuse. At the time of writing, Chaos Mesh is an open source incubating project with the CNCF.

Litmus

Helps SREs and developers practice chaos engineering in a cloud-native way.

Website | GitHub

Litmus is an open source chaos engineering project aimed at SREs who want to push their cloud-native architecture to the limits. Compared to Chaos Mesh, Litmus is a bit larger in scope, enabling developers to perform tests on many environments, including the Kubernetes platform, Kubernetes apps, cloud platforms, bare metal, legacy applications and virtual machines.

Litmus is easy to install using Helm:

helm install litmuschaos/litmus

Once installed, engineers can choose a chaos scenario from a number of pre-defined Litmus Workflows. ChaosHub is an open marketplace hosting many Litmus experiments to run chaos on various infrastructures. Litmus can structure chained sequences of experiments, so you can chain many experiments to wreak as much havoc as you like.

For example, the documentation showcases using the Litmus user interface to install an application, perform a chaos experiment on it, uninstall the application and revert the chaos.

Using Litmus, engineers can also create custom workflows and schedule workflows to occur on a regular basis. For an open source free tool, Litmus is surprisingly comprehensive, offering a feature-rich platform with a SaaS-like console.

ChaosBlade

A powerful chaos engineering experiment toolkit.

Website | GitHub

ChaosBlade is another toolkit that can help DevOps engineers and SREs perform chaos on their cloud-native systems. Originally produced at Alibaba, ChaosBlade was open sourced in 2021 and is currently a sandbox project hosted by the CNCF. The package includes two main components: The chaos engineering experimental tool, ChaosBlade, and a chaos engineering platform, ChaosBlade-Box.

Using ChaosBlade, engineers can perform experiments through a unified interface. The platform brings an assortment of features to help experiment with resource fluctuations pertaining to the CPU, memory, network, disk, process, kernel or files. Like the tools above, ChaosBlade also supports automating chaos engineering regularly.

Revel In Chaos

Unexpected, turbulent conditions are bound to arise from time to time. With that in mind, it’s best to be prepared. Chaos engineering brings many benefits to modern cloud-native operators, helping expose bugs or bottlenecks in a system. By testing your architecture early on, your team can also practice how you respond to unforeseen problems.

Above, we’ve covered three impressive chaos engineering platforms, which are all free and open source, hosted under the CNCF. Of course, these are not the only chaos engineering options out there. Some other open source chaos engineering projects include Chaos Toolkit, chaoskube and PowerfulSeal.

Bill Doerrfeld

Bill Doerrfeld is a tech journalist and analyst. His beat is cloud technologies, specifically the web API economy. He began researching APIs as an Associate Editor at ProgrammableWeb, and since 2015 has been the Editor at Nordic APIs, a high-impact blog on API strategy for providers. He loves discovering new trends, interviewing key contributors, and researching new technology. He also gets out into the world to speak occasionally.

Bill Doerrfeld has 105 posts and counting. See all posts by Bill Doerrfeld