Scaling Machine Learning Workloads Using k8s

May 7, 2019May 6, 2019 Jim Scott artificial intelligence, data-centric architecture, file system, Kubeflow, kubernetes, machine learning, open source, persistent storage

by Jim Scott

How k8s can help organizations scale their ML and AI applications in a data-centric architecture

Big data in general, and machine learning (ML) in particular, hold great potential for harnessing creativity and insights to solve difficult problems. Yet, most organizations spend little time and budget building the powerful models needed. Instead, they struggle to scale their complex infrastructures as applications change and grow. Unlike scaling web servers or distributed analytics workloads, machine learning forces users to be conscious of things such as GPU availability and data locality.

This article highlights the challenges involved in building applications that require using ML and artificial intelligence (AI) at scale on Kubernetes (k8s) and outlines how a change in data logistics helps overcome those challenges. Putting the focus on the data with a data-centric architecture represents a paradigm shift from the existing application-centric architecture and the different data silos it creates. This establishes a more enduring foundation that affords greater versatility and scalability for future ML-based data workloads.

If we start with k8s as a foundational piece of this data-centric architecture, end users immediately benefit from the luxury of no longer needing to worry about the very specific infrastructure details required to run their application. This is step one in simplifying data logistics and getting one step closer to simplified scaling of ML.

The Stateful Container Challenge

A major advantage with containers is that they have minimal overhead and can be started quickly and easily. A major disadvantage of containers is that they do not contain their own data. Any data present within the container during the lifetime of a container is ephemeral; therefore, it no longer exists after a container is stopped, which has the effect of making the applications stateless. In a production environment, data must outlast the applications, and therefore must be persisted somehow outside the containers.

Sponsorships Available

With k8s, persistent storage can be created and attached on-demand to a container when it is deployed. Orchestrating the data layer in this fashion hides where the container will read and write its data. The Container Storage Interface (CSI) standard, which defines the interface between the container orchestration layer and the data layer, was built to create a separation between the application and the underlying infrastructure. This enables the user to be free of such concerns and enables systems administrators to leverage the appropriate storage facilities for a given use case within a given environment. This is step two in simplifying data logistics and getting one step closer to simplified scaling of ML.

Leveraging a Data Platform for Scaling

While k8s provides the ability to scale applications up and down, it doesn’t offer persistent storage. It provides the interface to plug in the persistent storage, as mentioned previously. It is important to keep in mind that k8s may start up multiple instances of a piece of software, but it does not provide the mechanisms to enable those software instances to communicate with each other. Such a facility, such as an event streaming system, would be a part of the data platform.

There are a variety of options for plugging in different types of storage into k8s, but a data platform is really the ideal solution. An appropriate data platform acts toward the data orchestration much like k8s does for container orchestration. A data platform that provides all of the necessary capabilities such as the handling of files, database tables and event streams, is an ideal solution for a data-centric architecture.

It is important that those capabilities all be provided by the data platform, because ML has unique requirements around data management. It is important that both the static and real-time data be supported within the same data platform so the user may benefit from features such as consistent point-in-time snapshots over all the data.

The diagram shows an example of how a data platform serves as a persistent storage layer for applications implemented in multiple containers being orchestrated by k8s. The data platform should be compute-agnostic, making it suitable for all physical and virtual server architectures. A data platform with a full complement of features makes it easier to support a broad range of containerized applications, such as the ability to:

Provide persistent storage for both individual containers and pods of orchestrated containers, as this enables a separation of concerns for the software being orchestrated.
Enable containers to use the platform’s file system for all storage needs, including those with different file systems.
Afford a means to enforce security policies to ensure all access is authorized, including the ability to audit all data access.
Provide volume snapshots to enable data protection as well as to support analytics at a given point-in-time over real-time data.
Have the ability to span private, public and hybrid cloud environments, including multi-tenant and edge computing use cases.

In a typical application, volumes in the data platform are mounted by k8s or some other means for access by containers. As new containers are deployed, additional data volumes can be created and retained, even as containers are deleted. These and other capabilities make it possible for container-based applications to scale without any disruption to the existing applications or infrastructure. This is step three in simplifying data logistics and getting one step closer to simplified scaling of ML.

Tying Everything Together

A robust data platform will support a wide range of industry standard protocols and application programming interfaces (APIs), such as POSIX, NFS, S3, JSON, OJAI, Kafka and even HDFS. Support for these APIs is important in that even the simplest of ML use cases requires a variety of tools to solve the problem. With HDFS available, Apache Spark can be leveraged. With the Kafka API, event streams and real-time streaming can be delivered. Perhaps most importantly is the standard file system access such as NFS and POSIX. All of the great tools being created, such as MXNET, CNTK, and TensorFlow, among others, require access to standard file systems.

The real benefit to the broad API support is to reduce the amount of data copying and to cut out as much latency as possible from the system. Moving terabytes of data between systems takes a tremendous amount of time. It is also a barrier to scaling an ML-based system.

With a broad set of APIs, access to data (orchestrated by the data platform) and access to compute (orchestrated by k8s), the data scientist likely only needs one more critical piece of infrastructure to simplify their job, which is to create repeatability.

Workflow management for a data scientist is a critical component for success. Leveraging a tool like the open source Kubeflow is a great solution. It plugs directly into k8s and can be run locally or in a development environment and the same workflow can be moved and run in any other k8s-based environment. Each environment (development, testing, production) may appear radically different from the others; thus, this level of portability is key.

Machine learning workloads are becoming increasingly popular in organizations, and are increasingly turning to a combination of kubernetes, containers and a data platform to break down barriers to scaling those workloads to meet the most difficult demands of the business.