Hardening Kubernetes Beyond NSA, CISA Guidance

Information security and data privacy in the cloud has an abysmal track record. Hardly a week goes by without some major cloud-powered security-related incident making the news. We, as a community, need to shape up.

I have worked in cloud computing since 2008, and have both the battle scars and a PhD to show for it. Security has always been a top concern of mine, and in this article, I am giving away my security tips that go beyond what the NSA and CISA recommend in their Kubernetes Hardening Guidance tech report. In my experience, the realities of developing and operating applications in the cloud and Kubernetes require bigger-picture thinking than what those guidelines cover. 

That said, please follow all the recommendations in the NSA/CISA Kubernetes Hardening Guidance and then also do the following.

Prevent Misconfiguration, Don’t Just Check for It 

Role-based access control (RBAC) can determine who gets to do what and in what context. But just because a rule says that Lars gets to “update configuration” in the “production environment”, doesn’t mean he can have unfettered access—Lars must also be prevented from making mistakes. After all, two-thirds of all insider threats are due to negligence.

Kubernetes, like most cloud systems, only ships with a system to enforce RBAC, but nothing to enforce reasonable policies that limit what the user can actually do.

Checking after the fact for misconfiguration is a capability offered by some systems; AWS Config is seeing some traction now for this purpose.

But I’d much rather have a system that prevents misconfiguration altogether. Policies should be encoded in an automatically enforceable form. The CNCF project Open Policy Agent (OPA) can do just that. It can act as a Kubernetes Admission Controller and can therefore ensure that policies cannot be violated. OPA is very versatile; you can learn from the official library or pick and choose from other ready-made policies and base your own off of them.

Beware: Any Permission Given to an Application is Also Given to Bad Actors

If a bad actor manages to compromise your application, they will have the exact same permissions as the application. Perhaps this seems obvious. But my experience tells me that this is not taken seriously in practice. The bad actor will have all the capabilities within your Kubernetes container platform, within your network, within your cloud, within your third-party SaaS integrations, within your VPN-connected back-office location—all of the capabilities.

This is how bad stuff like ransomware manages to spread, after all. They infect just one point within a networked application and then keep going. The chain really is only as strong as its weakest link.

If we truly think this through, it seems ridiculous that your REST API component should have any permissions at all except to process requests and send back responses. And you know what? That’s because it definitely is and always was.

Keep Cloud Resources in Mind, too

We have all seen the headlines. Whether it’s S3 buckets that have been misconfigured to allow anonymous access or a master key for Microsoft Azure CosmosDB making it possible to access any customer’s database, the message is clear. Whenever we use cloud resources, we must always keep them and their configuration in mind.

There are various controllers in the Kubernetes ecosystem that make cloud integrations simple, and simplicity is great! But simplicity must never be allowed to compromise security. So you have to make sure that these controllers don’t put näive security settings on the resources they manage. You should flat-out reject tools that don’t clearly advertise how they manage security. Non-starters include tools that do not specify what IAM permissions they need and those that do not expose a way to configure which permissions they will put in place.

Does your Application Unintentionally have Permissions in your Cloud? 

Have you given the cloud servers access and permissions to modify your cloud resources? What AWS calls instance profiles (other cloud providers have the same concept under different names) grant permissions to a virtual machine to modify cloud resources. By virtue of running inside that virtual machine, a containerized application can have that, too. All it needs to do is a series of network calls to the cloud’s metadata service and it has the same level of access as you gave the server. Because it is running on the server, the cloud sees it as “the server.”

What I’ve seen over and over is that people will add a few permissions here and there to the server’s instance profile to make whatever they wanted to do work. Create a load balancer, update some DNS records, modify an autoscaling group; that sort of thing. But they only did it to support their intended use case, not with the realization that all unintended use cases would have the same permissions.

Regularly Scan all Deployed Container Images

Many container image registries support scanning images when they are pushed. This is great! And the NSA/CISA Kubernetes Hardening Guidance recommends that an admission controller request a scan upon deployment. 

But what if the image stays deployed for weeks or months because the software is stable? The initial scan weeks ago may have been clean, but a new one today would show vulnerabilities. Oops.

Instead, I am a firm proponent of regular scans of all container images that are actively deployed to your Kubernetes container platforms. Automate this check by determining all deployed container image versions and scan them daily.

This means you gain the benefits of up-to-date vulnerability databases against old container images of stable software that gets infrequent updates. If the scan finds an issue in dependencies, you can rebuild the image with fresher dependencies and deploy it with (hopefully) few or no other changes, since the code itself didn’t change.

Regularly Security Test your Entire System

Your software engineers have something external threats don’t: Access to source code. If you also arm them with the time and blessing to security test your system, magic can happen.

During a past project, I fondly remember discovering that creating a certain type of resource in an automated way brought the entire system to a screeching halt. A single laptop could successfully launch a denial-of-service attack on the entire system while doing nothing obviously malicious.

Even if your engineers may not be trained for security per se, the main idea is to instill a security-first mindset. And that can be the difference between smooth sailing and a security disaster.

Have a Disaster Recovery (DR) Plan and Practice It

The amount of companies I talk to that think disaster recovery only means “backups” is mind-blowing. Hint: It’s really not. Backups are necessary but not sufficient. 

Recovering from a disaster means the ability to stand up your entire tech stack elsewhere within a certain time frame. And while disasters are usually thought to mean the outage of entire cloud regions, I think that a security incident definitely counts as a disaster! Since you can no longer trust your deployed applications, you need to answer the question, how quickly can you destroy your entire infrastructure and get back to where it was before an incident?

Companies that still think that DR just equals “backups” will, when asked the uncomfortable question, often admit that they don’t even regularly try to restore from those. If information technology is at the core of what you do, please take this aspect seriously.

Use an Intrusion Detection System (IDS) and a Security Information and Event Management (SIEM) System

The Kubernetes Hardening Guidance mentions these, but it doesn’t actually tell you what to do with them or how to use them. 

IDS records and monitors the normal behavior for applications and constantly checks activity against these baselines. If an application starts to behave in new ways, it’s a possible sign that it has been exploited by a bad actor. For instance, if it starts to attempt reading or writing files when it usually doesn’t, that’s a pretty good sign—it’s not like it started doing that on its own! 

The CNCF project Falco also is here to help you follow this guidance. Specifying rules is, of course, cumbersome, but it is essential to provide the guardrails your application needs. There are community-provided ones you can start off with.

Falco can, in combination with, for example, Elasticsearch, inspect your (audit) logs and, in that way act as a SIEM. I recommend using it in this way, too. If you already have a different system in place, then, by all means, use that. But use something. Because many regulations these days require that you inform users about data breaches, so you really want a system that helps manage security information and events. The amount of security log data is too great to process manually, especially if you are currently under attack.

Information security is not a one-time box to tick—it’s an ongoing process. The threats are constantly evolving, so the responses must, as well. By putting guardrails into our platforms and constantly striving for giving only the least amount of privileges to our applications and servers, we can reduce our attack surface. And that is especially important in the cloud. 

The inherent complexities and dynamic nature of the cloud offer many places for a bad actor to both carry out attacks and to hide. It’s up to us to limit those opportunities.


To hear more about cloud-native topics, join the Cloud Native Computing Foundation and cloud-native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021

Lars Larsson

Lars has worked with cloud technology since 2008, both within industry and academia. He holds a PhD in computer science and is a senior cloud architect at Elastisys. He is one of the architects of Compliant Kubernetes, a CNCF certified Kubernetes distribution, designed and developed especially to meet the high security demands of regulated industries.

Lars Larsson has 1 posts and counting. See all posts by Lars Larsson