Best of 2019: 5 Ways to Chaos Test Kubernetes

As we close out 2019, we at Container Journal wanted to highlight the five most popular articles of the year. Following is the fourth in our weeklong series of the Best of 2019.

The hottest trend in the software industry is Kubernetes, a new way of thinking about online architecture and application deployment. In the early days of the internet, organizations would have to host individual web servers to run different websites. Then came the virtualization model, where a single piece of hardware could handle multiple operating systems and deployments.

With Kubernetes, the architecture dependencies are simplified so one container can run multiple applications without the need to manage various operating systems or virtualized hardware. Companies across the globe are seeing the benefits of this approach to cloud computing, especially when it comes to testing and improving their software quality.

But downtime, outages and vulnerabilities are still possible in a Kubernetes environment due to the complexity of modern microservice-based applications. Normal quality assurance testing is great at catching bugs and product deficiencies, but it can’t always predict the types of incidents that will occur. That’s where chaos engineering can drive real value, as it aims to be proactive in identifying future issues.

In this article, we’ll offer five ways to start using chaos testing within a Kubernetes environment.

1. Introduce Network Failure

In most cases, a Kubernetes cluster will be hosted through a cloud provider that manages its own data center. This makes your containers heavily dependent on a complex network of infrastructure. You will want to ensure that if a piece of network equipment goes out or gets misconfigured your applications can recover in a timely manner.

To launch the chaos test, identify a network node that feeds into your Kubernetes environment and then either shut it down or modify its IP address so that it becomes unreachable. Then wait and study how your containers and applications react. With the data you gather, you can create network policies for your Kubernetes pods that can include fallback steps in case of a network failure.

2. Max Out Resources

One of the big benefits of a Kubernetes architecture is resource isolation, meaning that applications in a container share computing power but still have predictable performance levels. However, this changes if the hardware behind the cluster starts to fail or hits maximum levels of resources.

For this chaos test, simulate an individual server hitting a CPU exhaustion level and then becoming unavailable for a period of time. Your container may try to restart itself or pull resources from other areas. When analyzing test results, it can be a great time to set up Kubernetes health checks that can proactively warn you about resource failures.

3. Simulate Global Traffic

One mistake that a lot of software developers make is to always test their applications with local, internal traffic. This is not realistic; when a product goes live on the internet, it becomes accessible to people all over the world with varying internet speeds.

For this reason, when running load tests on a Kubernetes cluster, work through a virtual private network (VPN) so that you can specify a region or country for each test and obtain a new IP address from that area. While there are several providers to choose from, use a VPN service based on the OpenVPN protocol. OpenVPN is fairly straightforward to implement and can be run inside a Kubernetes Cluster and if configured properly, will allow you to reach pods via a private network.

From there, look for a provider that has servers in lots of different countries. This allows you to track performance across the globe and see if you should consider replicating your Kubernetes configuration to multiple data centers. With a VPN installed, the process of defining different IP addresses is not difficult.

VPNs can also be used to increase the security of your clustered environment. Configuration elements for Kubernetes should be locked down to a specific group of users and IP addresses, reducing the chance for a hacker to be able to infiltrate your systems and launch an attack.

4. Block DNS

Every web request gets routed across the internet using the domain name service (DNS), which is in charge of translating hostnames into IP addresses. At the enterprise level, a Kubernetes container must be able to handle the final step of DNS traffic as it handles incoming and outgoing requests.

Simulating an internal DNS failure can be a valuable chaos test, as it will show you how your Kubernetes cluster will react in a situation where DNS lookups are unavailable. You may see a sudden drop in incoming traffic and new Kubernetes pods may fail to launch. As a takeaway, you will want to add DNS logging to your environment to help with debugging in the future.

5. Overload Storage

A Kubernetes container typically relies on a set of redundant local hard drives or networked storage devices. Applications within the container are automatically load-balanced and share local memory, but a single drive failure can cause problems for the entire cluster using the same hardware set.

To simulate overloaded storage in a chaos test, try using a tool that will add drive latency for local storage to make the cluster think that memory is unreliable. This may identify a need to add dynamic volume provisioning to your Kubernetes environment so that a container can automatically request more storage space in case of an incident.

It’s also important to test storage redundancy if you operate your Kubernetes cluster across multiple global regions. In the event of a natural disaster or major cloud outage, your load balances should automatically reroute traffic to a working redundant host so that the impact to external users is minimal.

Final Thoughts

Embracing the Kubernetes architecture model can be a huge win for software teams of all sizes. It will greatly simplify your deployment and integration processes, although you can’t forget about the environment complexity sitting at the underneath layer. Kubernetes is designed to be able to react to certain issues, but you can’t predict everything.

The principles of chaos engineering urge developers to introduce unexpected changes to an environment and see how the system reacts. Running these types of tests with Kubernetes is a great way to understand your network weaknesses and uncover new ways to configure your cluster for maximum performance. With the right setup, downtime can be a thing of the past.

Sam Bocetta

Sam Bocetta is a freelance journalist specializing in U.S. diplomacy and national security, with emphases on technology trends in cyberwarfare, cyberdefense, and cryptography.

Sam Bocetta has 3 posts and counting. See all posts by Sam Bocetta