I recently came across a helpful compilation of Kubernetes failure stories, a public list maintained by Henning Jacobs, a senior principal engineer at ZalandoTech. This community-driven project provides a comprehensive view of Kubernetes anti-patterns and helpful knowledge on how not to run Kubernetes.
The stories on k8s.af are written by engineers and implementors, and describe many unfortunate situations — such as CPU limits that cause high-latency, IP ceilings that prevent autoscaling, missing application logs, killed pods, 502 errors, slow deployments, production outages – the list goes on.
Hopefully, by analyzing these failure stories, others can learn how to better configure and improve their K8s environments. Here are some of the stories.
1. CPU Limits Cause High Latency
Setting CPU limits is a double-edged sword. You don’t want to waste compute resources; yet, setting artificial limits could result in containers exhausting all available CPU. This could cause a cascading series of events which could bring performance to a grinding halt and dismantle other components.
To throttle containers, Kubernetes uses completely fair scheduler quota (CFS Quota) to prevent going over CPU limits. Unfortunately, aggressive throttling in Kubernetes can lead to performance issues.
The story of Buffer is one example. After experiencing poor performance due to artificial throttling, the infrastructure team eventually decided to remove CPU limits and throttling for user-facing instances, assigning the correct CPU on a per-node basis with a >20% margin. By doing so, the team was able to reduce container latency by a factor of at least 1X for all containers. For the main landing page, the end result was 22X faster.
“Our transition to a microservices architecture has been full of trial and error. Even after a few years running k8s, we are still learning its secrets,” wrote Eric Khun, infrastructure engineer, Buffer.
Removing CPU limits should be approached with caution. Instead, Khun recommends “upgrading your kernel version over removing the CPU limits. If your goal is low latency, remove CPU limits, but be really mindful when doing this.” He recommends setting up proper CPU requests and adding monitoring using a solution like Datadog.
2. Missing Application Logs
Logging is essential for diagnosing errors and remediating issues. But what happens if your application is not producing logs?
PrometheusKube shares the story of an odd outage — for some reason, one day, a specific node stopped shipping logs. The team was using
fluent-bit to ship logs, and noticed Elasticsearch was failing certain requests.
After turning on debug logging, the team decided to deploy Fluentd, slowly rolling it out to replace
fluent-bit on a node-to-node basis. “It’s impressive how Kubernetes allows you to iterate on deploying new software quickly,” the team says. After editing another configuration, they were finally able to send logs without errors.
PrometheusKube recommends monitoring based on traffic, and using a black box monitoring method to catch a similar situation.
3. Autoscaling Blocked Due To IP Ceiling
What’s great about cloud-native architecture is the ability to scale — rapidly and efficiently. Elastic computing models help an application respond to new demands automatically. If a computing environment cannot create new IP addresses, however, autoscaling isn’t possible.
This was the situation Dmitri Lerko, head of DevOps at Love Holidays, described on his personal blog. The Love Holidays team learned about the issues after receiving reports of slow deployments. An application that would typically take minutes to deploy was taking hours. Half of the pods in a cluster were serving traffic as usual, but the other half was stuck in a pending state. How had they run out of IP addresses?
It turns out that, by default, Google Kubernetes Engine (GKE) uses far more IP addresses than anticipated. “GKE allocates 256 IPs per node, meaning that even large subnets like /16 can run out pretty quickly when you’re running 256 nodes,” Lerko says. To avoid similar issues, Lerko recommends reducing the maximum number of pods per node, consider using subnet expansion to increase the available IP range or increasing the size of existing nodes.
4. Misconfigured Load Balancer Causes Total Outage
Production outages, downtime, or even partial production outages can greatly affect user experience and inhibit business growth. Marcel Juhnke, writing for DevOps Hof, describes how misconfigurations resulted in a total outage for ingress in a particular cluster while migrating workloads from one node-pool to another in GKE. Remedying this situation simply involved deleting old nginx-ingress-controller pods. Nevertheless, Juhnke says, “Before doing any changes that might touch any traffic, look twice into the documentation.”
5. Cryptominer Caught on Kubernetes Development Clusters
With cryptocurrency rising in value, hackers are on the lookout for vulnerable compute power to steal to mine cryptocurrencies, such as Monero. This is what occurred to JW Player, writes Brian Choy, from the JW Player DevOps team.
After receiving numerous automated alerts for increased load, the DevOps team dug deeper to find a process running at 100% CPU utilization, which was highly suspicious. In short, the hacker exploited a Kubernetes monitoring tool, Weave Scope, which exposed a public-facing load balancer security group and dashboard. Using this information, the hacker was able to gain access to the root directory on JW Player’s Kubernetes nodes.
While this particular breach did not affect any production services, it did waste compute power; a breach of this magnitude is alarming, to say the least. The team immediately remediated the situation by removing the Weave Scope deployment, making updates and improving RBAC permissions to restrict Weave Scope’s access.
The team is also considering adopting more insightful monitoring for behavioral analyses and anomaly and intrusion detection. Choy reports they are also looking into service mesh options, such as Linkerd and Istio, to protect end-to-end traffic.
“A postmortem describes a production outage or paging event including a timeline, description of user impact, root cause, action items and lessons learned,” writes David N. Blank-Edelman, in Seeking SRE.
We’ve only scratched the surface of many documented Kubernetes postmortems. Hopefully, by studying some of these situations, others can avoid the same fate. From the five stories we analyzed above, some takeaways include:
- High latency can easily be caused by CPU limits and aggressive CPU throttling.
- Monitoring based on traffic can help spot logging issues.
- Understand how your set up handles IP addresses, and plan accordingly.
- Watch for outages during migration attempts.
- Insecure defaults from managed service providers remain a common vulnerability. Limit what you expose publicly. More advanced behavioral monitoring and service mesh could help identify anomalies like cryptojacked clusters.
10 More Weird Ways to Blow Up Your Kubernetes, presented by Airbnb at KubeCon NA 2020, is another roundup of common Kubernetes concerns. To get Kubernetes in top-notch working order, you can review these, and other failures, here.