Using Machine Learning and Kubernetes Logs to Automate Security Threat Detection

The audit log is a rich source of information on activity within your Kubernetes environment and a valuable tool for detecting threats

Kubernetes is quickly consolidating its place as the leading container orchestration platform for cloud-native applications, with adoption at 59% among enterprise IT professionals as of March. But while Kubernetes delivers agility, flexibility and scalability for DevOps teams, it also creates complexity that can be an enigma for SecOps teams—especially when something goes wrong.

When it comes to detecting threats and tracking down breaches in Kubernetes, security teams’ key asset is the Kubernetes API server audit log. The audit log captures every successful or unsuccessful API server call, whether by human users, automation pipelines, controllers, operators, system components and, in general, API server operations performed by principals identified as service accounts. In other words, the audit log is the single source of truth for what’s happening inside a Kubernetes infrastructure.

But just because the audit log tracks all the activity in your Kubernetes environment doesn’t make it easy to find what you’re looking for. The larger your infrastructure, the more events there will be to parse through. The Kubernetes audit log is also extremely verbose, capturing a number of identifying attributes for each action, such as detailed information about what resource was accessed, who accessed it or what IP address the request came from. When the Kubernetes infrastructure at hand involves multiple clusters, which is the most common use pattern, it becomes a daunting task to place effective security monitoring controls across all clusters and manual security audit log monitoring becomes extremely impractical.

One simple approach to automate Kubernetes Audit log monitoring, which is quite useful for compliance purposes, is through the use of simple static policies that trigger alerts when a violation occurs. For example, an alert could be triggered if a human user accesses a workload that processes sensitive data. This simplified approach is useful when you have a precise notion, as in cases of regulatory compliance of what use patterns are allowed. However, in real-life deployments, adversaries never limit themselves to what is allowed.

Another approach is threshold-based security monitoring, which captures when API events, or combinations of events, rise above a certain threshold. However, this tactic can be easily bypassed by an adversary simply by controlling the rate of compromise attempts, spreading the attack out over time to avoid crossing the threshold.

Threshold-based based monitoring is also prone to false positives that can really wear out security and teams and create alert fatigue. For example, periodic batch processing workloads, which create a frenzy of API calls, would break through pre-configured thresholds but aren’t actually a threat. To separate threats from routine deployment events, you need to understand the context, something threshold-based monitoring alone cannot provide.

With significant, high-maintenance efforts, and potentially high rates of false alerts, threshold-based monitoring can help security or operations teams to catch known threats and vulnerabilities, but it won’t catch unknown threats or vulnerabilities. Again, to identify a threat, you need to be able to understand the context of any anomalous behavior: who is making the API calls, from where and what resources are they accessing. You also need to observe this data over time to establish what normal behavior looks like so you can differentiate it from anomalous behavior.

As a powerful, elastic and flexible cloud-native application infrastructure that drives many automatic processes, Kubernetes is inherently complex. To detect a security threat in Kubernetes, you need an adaptive security monitoring solution powered by machine learning. Through time-based observation of the various actors within a Kubernetes cluster that leverage the API server audit log, machine learning can learn and detect anomalous patterns in actor activity that simple tools will miss.

Learning the behavior of your clusters is no trivial task. Profiles for different users, components and automated services need to be built over time, and with so many moving parts, this is impractical to do manually. At the same time, new vulnerabilities are constantly coming to light, and there will always be vulnerabilities that are waiting to be uncovered. You can set alerts for vulnerabilities you know, but to catch unknown threats you need a strong baseline of activity in your cluster to monitor against, something that the right machine learning algorithms can readily capture and adapt to over time.

While machine learning is a broad topic, it is quite clear that the traditional approach of supervised or unsupervised offline model creations would not yield satisfying detection results, simply because different environments are accessed and behave differently. There has to be a component that learns and adapts to the specific Kubernetes environment being monitored.

Here are two security use cases that machine learning can reveal through the analysis of the Kubernetes audit log stream:

Stolen Service Account Tokens and User Credentials

Machine learning-based audit log monitoring tools can alert security teams of suspected credential theft. For example, if the same cluster credentials are reused from multiple different geographic or network locations within a short period of time, it can be flagged as an anomalous behavior and, if combined with additional abnormalities, could indicate stolen credentials.

By tracking and learning the geographical attributes and the access patterns of individual users or principals, machine learning can detect this anomalous behavior and flag it for security teams. In response, the security team can take actions to reduce or eliminate the blast radius—such as limiting access to the API server from the specific suspected IP address or modifying the Kubernetes role-based access control (RBAC) policy to reduce access privileges for the specific account.

Misconfigured Kubernetes RBAC

In addition to detecting threat actors that attempt to pivot within the Kubernetes cluster, machine learning can also leverage the audit log to detect attempted exploits of known or unknown vulnerabilities if the RBAC permissions are misconfigured or over-permissive. On the flip side, once RBAC policies are properly configured, reducing RBAC privileges to the required minimum, machine learning can alert when unauthorized users make unsuccessful API calls to access sensitive resources, indicating a failed exploit attempt.

Ultimately, the audit log is an incredibly rich source of information on activity within your Kubernetes environment and a valuable tool for both detecting threats and vulnerabilities and forensically tracing breaches. Although the complexity prohibits manual monitoring, machine learning can give security teams real-time observability into security incidents within their Kubernetes environments and enable informed investigations and audits that get to the root of problems more quickly.