Overcoming Challenges of Air-Gapped Kubernetes

Military and intelligence agencies, hospitals and corporations all deploy air-gapped environments to protect their sensitive information from breaches and theft. An air-gapped environment aims to isolate and limit access to classified and sensitive data.

An air-gapped environment can be as simple as five PCs in a room connected only to one another, whereas air-gapped environments like the U.S. Department of Defense Secret Internet Protocol Router Network (SIPRNet) are composed of more elaborate architectures.

What air-gapped deployments all have in common is limiting access to their data, which typically means removing access to the internet. However, this type of isolation presents challenges in a Kubernetes environment that was designed to be networked.

In a connected Kubernetes environment, you access images in a number of ways including pulling images from Docker, sudo, apt updates, Wget and GitHub downloads. These methods expect and depend on a rich support infrastructure. In an air-gapped environment, this support infrastructure disappears. The means the alternative support infrastructure for your clusters needs to be well-thought-out and architected.

When you remove the connection to the internet, you create a host of challenges. How you approach these challenges dictates the architectural decisions involved in designing your air-gapped environment, which include:

  • Transfer process
  • Migration cluster
  • Image repository
  • Helm and code repository
  • Network boundaries and ingress
  • Data processing
  • Documentation

Transfer Process

How you transfer data into your air-gapped system is a key consideration in how you architect the system. Are you operating in a virtual private cloud (VPC) with a virtual private network (VPN) connection, or are you burning disks to insert into a PC?

The variables of the transfer process dictate the air-gapped infrastructure you will deploy. Beyond Day 1, when you get your system up and running, what is the speed and volume of data you will be bringing into your cluster over time? Images can be many hundreds of gigabytes and Kubernetes updates occur every few months. You will need to update your applications, which will have security vulnerabilities and need to deliver new features. All these considerations will dictate how you structure your air-gapped cluster.

Organizational issues are another factor that will inform how you structure your air-gapped environment. Are you developing your applications in-house, using outside contractors or a combination of both? The workflow and integration of the people and processes must be accommodated.

Burning disks and handing them off for approval could be a two-day or three-day manual process. This could be a laborious process if you are processing gigabytes of data and regularly updating applications manually.

A more sophisticated transfer process enables you to automate and streamline your processes. For example, you can run your transfer through a file transfer protocol (FTP) server, which enables the transfer of files from one computer to another. The data can be security scanned and placed in a folder.

The transfer process is a core component that defines the limits of an air-gapped system. It is an important issue that must be resolved upfront because it will significantly impact the life cycle of the entire operation. The speed of your transfer process will affect how quickly and efficiently work is performed and updates are made.

Migration Cluster

A way to assist the transfer process is to deploy a migration cluster, which is a separate cluster that mirrors your air-gapped cluster but is connected to the internet. This will help you work through bugs and accelerate the speed of delivery.

The migration cluster can be your first stage of development. It also can be the system on which your contractors and other workers can perform their tasks without having direct access to the air-gapped environment. The migration cluster can be used for training, upgrade testing and application delivery. It also provides an extra measure of security by not giving personnel direct access to sensitive air-gapped data.

For example, one of our customers has its environment running on Amazon Web Services (AWS), but also on fleets of ships on the ocean. At sea, the company has an internet connection, but it does not want its cluster to operate using the internet connection—because it is what customers are using. To deliver this capability, they start with an internet-connected migration cluster, perform all the testing and development in that cluster, then package it up and transfer it to their air-gapped system.

Image Repository

Although you can implement an air-gapped system without an image repository, you won’t get very far. The image repository is one of the core components from which Kubernetes processes pull containers.

Also, by backing up the image repository, you can back up all the containers that perform work on your system, and you can deploy Kubernetes as infrastructure-as-code (IaC). For disaster recovery, it becomes your first line of defense. Once you have your images, you can do a backup of a Kubernetes cluster and all the configurations that were deployed to it. This enables you to easily reconstitute an entire cluster and operating system.

Your image repository also enables you to perform security scans of all the images you bring in, which can strengthen your supply-chain security. All image repository technologies—such as Harbor, Nexus and Artifactory—provide some form of image scanning. Security also is strengthened by being able to control who loads images into the cluster.

Because the image repository is a critical core component, it should be designed to ensure the highest availability. The image repository should be the baseline component for which you calibrate availability.

Helm and Code Repository

In a Kubernetes management platform, Helm is the de facto management package. Helm makes it easy to install, update and remove applications and services that are highly repeatable or used in multiple scenarios in a typical Kubernetes cluster.

Helm charts can be stored in a code repository such as Git. This enables you to store code with which you can rebuild containers on the system.

For automating and streamlining your operation, hosting a code repository is an important element for enabling high-speed DevOps. Once you put a code repository on your air-gapped system, you no longer have to perform all of your development externally on the internet and bring the data into the air-gapped system.

Having a code repository also enables you to perform development on your air-gapped system. The code repository does not have the same high availability requirements as the image repository, so you accrue a good deal of value without as much overhead.

Network Boundaries and Ingress

In an air-gapped environment, there will be network boundaries, including some form of firewalls or blacklist. The air-gapped system must have a way of bringing data in and out. Most air-gapped deployments will have a connection to other systems. Thus, you must consider the network boundaries you will set up for your cluster and how you expose your cluster to your larger IT system.

Because most air-gapped systems operate on hardware on-premises, the manner in which data is brought in and out of the system is an important consideration. There are various schemes that can be employed, including ways to mimic the load balancing and limited ingress of a cloud-native environment.

Data Processing

Once you have your data in your air-gapped system, how are you processing it? The processing power of the machines must be up to the task. Because you are disconnected from the internet and do not have the auto-scaling capabilities of a connected cloud service, your on-premises setup must be able to accommodate the workload.

Kubernetes lets you manage graphical processing units (GPUs) across multiple nodes. You need to properly load balance the utilization of the CPUs and GPUs involved in workload processing to avoid bottlenecks. For AI applications, the on-premises infrastructure must have the ability to process large AI models adequately.

Documentation Dilemma

Another consideration is the documentation in an air-gapped environment. Most of the documentation related to Kubernetes is online and you need a connected network to read the documents. You should plan to print or download the documentation before you start deploying clusters in an air-gapped environment.

Some organizations, like D2iQ, addressed this problem by making air-gapped documentation downloadable as a PDF. This enables customers to refer to the documentation in secure environments that do not have access to the external network.

Putting it All Together

Kubernetes is designed for a connected network in which cluster provisioning and application deployment pull container images from publicly hosted container registries over the internet. However, while this process eases Kubernetes management under networked conditions, it proves to be the main challenge for an air-gapped environment.

In an air-gapped environment, you need to host your own container registry, secure it and make it highly available and fault tolerant. This is no easy task and adds to the operational burden. On top of that, you need to manage the cluster life cycle, including cluster creation, cluster updates and application life cycle through the container registry. Kubernetes does not make it easy because the configuration is hidden deep in the container runtime; you need to have a deep understanding of Kubernetes to be able to do that successfully.

To manage an air-gapped system successfully, you have to have an overall understanding of the system, including where you have access and where you don’t, what you can use and not use and you have to design the system with all these considerations in mind.

You can get around some of these problems with makeshift fixes, but eventually, you will be stuck in a vicious loop of always trying to work around the problems. These headaches can be avoided by designing your system to handle the problems from the beginning.

Deepak Goel

Deepak Goel is chief technology officer at D2iQ.

Deepak Goel has 3 posts and counting. See all posts by Deepak Goel