Best Practices for Kubernetes Incident Response

Kubernetes is the world’s most popular container orchestrator, used to manage large scale applications running on container engines like Docker. Containers are rapidly replacing virtual machines as the go-to choice for workload deployment, and Kubernetes is becoming a critical part of IT infrastructure, due to capabilities like multi-cloud support and auto scaling. 

Organizations are running everything from web applications to distributed batch jobs to mission-critical enterprise applications on Kubernetes. Any system that runs critical applications becomes a target for attacks, and Kubernetes is no exception. However, Kubernetes raises new security challenges. Containerized environments are characterized by high complexity, a large number of moving parts and low visibility. This makes it difficult for security teams to detect, not to mention respond to, attacks on the Kubernetes control plane and individual pods and containers. 

In this article, I’ll explain how Kubernetes fits into the incident response puzzle, and how to improve your organization’s ability to respond to attacks on containerized infrastructure. 

What Is Incident Response?

Incident response is a set of actions implemented in the immediate aftermath of an event that compromises the security of a system, such as data loss, a service shutdown or cyberattack. 

Most organizations create an Incident Response Plan (IRP) to provide guidelines and instructions for teams to respond to incidents. This makes it possible to detect events in a timely manner, remediate security issues and restore systems. The end goal is to minimize the damage to the system and the duration of service disruption.

If you don’t have a properly thought out incident response strategy in place, you will be much slower to respond to incidents, and may miss critical security incidents altogether. This results in the common phenomenon of attackers penetrating a system and then “dwelling” in it for weeks, months or even years. This can create enormous cumulative damage, hurt business operations, damage reputation and result in legal liabilities. 

Here is an incident response process you can use in a containerized environment:

  • Identify—fast and accurate incident identification is the key to robust and effective event management. The focus of this step is to monitor security events to detect and report events in Kubernetes clusters that have security significance.
  • Coordinate—after an incident is reported, the response team reviews and evaluates the nature of the event to determine whether it represents a potential security incident and initiates the incident response process.
  • Resolve—the focus of this phase is to investigate the root cause, limit the impact of the incident, address imminent security risks, implement remediation measures such as necessary fixes and restore affected systems, data and services. 
  • Improve—as new incidents occur, new insights can be gained to help improve tools, training and processes.

How Kubernetes Changes Incident Management

The impact of Kubernetes on incident management can be broken down into two categories: ways in which Kubernetes makes life easier for site reliability engineers (SREs) and incident response teams and problems or complexities that Kubernetes introduces to incident management.

Opportunities

Kubernetes can make it easier for incident response teams and SREs to carry out their objectives. From the perspective of an SRE, one of the main benefits of Kubernetes is the introduction of automation to help make the application environment more reliable.

Orchestration—which is the primary focus of Kubernetes—and reliability are closely intertwined. Kubernetes can perform tasks such as automatically restarting failed services and moving workloads from a failing host server (or node) to another. As of Kubernetes 1.8, you can leverage the cluster autoscaler feature to automatically adjust the number of nodes in your Kubernetes environment according to the requirements of the application.

Challenges

It is difficult to handle incident response in Kubernetes because you cannot retrieve the data required to assess how an event may impact your system. 

In a distributed environment with volatile containers (many containers are ephemeral and have a life cycle of only a few minutes), it is difficult to conduct investigations and forensic analysis. Logs locally stored on cluster components like nodes and pods are eliminated when those elements are shut down or regenerated. 

Any element that developers have not instrumented up front will not be reflected in centralized logs—and thus cannot be used for incident response investigations. In a traditional environment with VMs and servers, you could SSH into a system to investigate it, but this is usually not possible in containerized environments, because resources are ephemeral. 

To summarize why investigation of incidents is difficult in a containerized environment:

  • Short life span—containers can be spun up and down within seconds. If an attacker exfiltrates data and then the container is shut down, any record of the attack disappears along with the container. 
  • Communication—containers communicate over ad-hoc networks defined by virtualization, with frequently changing internal IP addresses.
  • Separation from workloads—applications and interactions with the infrastructure are separated, with the Kubernetes control plane isolated from workloads. Information on changes to orchestration, access and privileges are provided by Kubernetes API audits, but there is limited visibility into containers.

It is almost impossible to get an understanding of all changes made in your cluster. If you cannot map system activity to services or users, the security team cannot identify malicious behavior or misconfigurations in Kubernetes. The information provided by existing tools is not correlated, so the security team is limited in its ability to determine who did what in a Kubernetes cluster.

Security Controls and Forensic Analysis for Kubernetes

In light of the challenges raised above, here are a few ways to put the appropriate controls in place to enable effective Kubernetes incident response.

Leverage Kubernetes Monitoring Tools

Kubernetes monitoring tools enhance visibility into containers, pods and clusters. They help ensure reliability, troubleshoot problems as they arise, tune performance, reduce costs and—most importantly—identify and shed light on security issues.

When selecting security tools for Kubernetes, consider tools that are designed to deal with the threats affecting Kubernetes, and are able to operate in a containerized environment. A great example is Prometheus, an open source tool that uses service discovery to identify elements in your Kubernetes clusters and provides agents and exporters that can help you pull data from cluster elements into centralized logging.

Make it Clear When to Escalate

Before launching code into a production cluster, it is important to understand the application and infrastructure security model and make it clear to DevOps teams what constitutes a security incident and when they should call in security specialists or request other assistance. 

For example, an incident response might begin with the operations team submitting a potential incident and then classifying it as a security incident. Operations then assigns the incident to the appropriate security team member. An incident response plan defines when to call in external security experts and how to engage them. Developing these processes is critical to ensure effective incident response, because the first person who will likely notice or identify a security issue is a cluster administrator. 

Container Forensics

When a security incident happens, how can DevOps and security teams investigate the issue? Here are three ways to gather forensics information about a container that appears to have been compromised.

1. Collect logs

Obtain logs for the underlying cloud infrastructure, Kubernetes audit logs related to the container and pod, application logs and operating system logs. User logins, network connections, SSH sessions, and running processes can all be significant for investigation.

2. Create a snapshot of the node

The next step is to take a disk snapshot of the node running the container. Then you can move other workloads, isolate the node, and perform other analytics:

  • Identify the affected node and any disks attached.
  • Make a disk copy.
  • Send copied disk images for analysis.
  • Use the Docker Explorer tool.
  • Compare binary differences in disk snapshots.

3. Correlate with other security data

Kubernetes security is not separate from other layers of your security stack. Correlate forensic data from containers with other, traditional security data:

  • Firewall and intrusion prevention/detection system (IPS/IDS) logs
  • Data from identity and access management (IAM) systems
  • Endpoint security alerts
  • SIEM alerts

Kubernetes is changing incident response, and there are a few key areas where you can improve your ability to respond to Kubernetes attacks:

  • Leverage Kubernetes monitoring tools – Traditional monitoring does not work in a containerized environment, so teams should adopt and learn cloud-native monitoring systems that can run on distributed, ephemeral components.
  • Make it clear when to escalate – Kubernetes administrators will be the first to notice something is wrong. They should have a clear incident response process for investigating incidents and involving the security team.
  • Container forensics – Ensure you have the logs and data you need to respond to an incident, and that security teams have access to containerized environments. Make it easy to correlate Kubernetes logs with data from other security tools

These best practices can help as you build your security strategy for the new cloud-native environment.

Gilad David Mayaan

Gilad David Maayan is a technology writer who has worked with over 150 technology companies including SAP, Samsung NEXT, NetApp and Imperva, producing technical and thought leadership content that elucidates technical solutions for developers and IT leadership.

Gilad David Mayaan has 53 posts and counting. See all posts by Gilad David Mayaan