Artificial intelligence (AI) is a huge, divisive topic. We often hear about it in the mainstream media as if it’s this conscious, evil provocateur, like a Skynet about to control all of humanity. However, the actual use cases, like self-driving cars and robotics, are a bit less threatening. And when it comes to our digital devices, AI is already embedded into most of the products we use today—for example, our phones and applications already use AI for things like spell checking, noise-canceling and face detection. Banks now use AI to help detect fraud and health care is adopting AI to make MRI scans faster and cheaper, thus improving patient outcomes.
Simultaneously, most companies are making the shift toward cloud-native technologies. This new software stack adopts containerization and leverages open source tools like Docker and Kubernetes to increase agility, performance and scalability for digital services. Even highly-regulated industries are going cloud-native.
While maintaining cloud-native technology may begin easily, the burden can quickly balloon as multi-cluster, multi-cloud modes begin to materialize. And to optimize their products and software release processes, organizations will often seek to run more complex workloads and build insights off their data.
I recently met with Tobi Knaup, CEO & Co-Founder of D2iQ, to explore the current and future role of AI within the cloud-native stack. According to Knaup, “the companies that can figure out how to leverage AI in their products will be the leaders of tomorrow.” Below, we’ll explore two major use cases for AI in cloud-native: Using cloud-native technology to host and run AI/ML computation as well as improving cloud-native architecture management through the use of AI.
Using AI and Cloud-Native Architecture
There are many advantages to running AI with cloud-native tools. One benefit of using Kubernetes is that it can have a centralizing effect—it makes sense to run these related components, such as microservices, data services and AI components, within the same platform. “Kubernetes is a fantastic platform for running AI workloads,” said Knaup. “You need a smart cloud-native platform for running these AI/ML workloads—a lot of the AI problems have been solved in cloud-native.”
Another critical challenge AI/ML projects face is figuring out Day 2 operations. While companies may have many data science experts to build and train models, actively deploying and running those models is an entirely different story. This lack of understanding could be why 85% of AI projects ultimately fail to deliver on their intended business promises. Cloud-native tech like Kubernetes provides a means to actively run these models as an online service that contributes value to the mission-critical product, says Knaup.
Benefits of Running AI With Cloud-Native Components
AI/ML and cloud-native have similar deployment patterns. The AI/ML field is still relatively young. As it turns out, many of the best practices that DevOps has established around cloud-native can also apply to AI/ML. For example, CI/CD, observability and blue-green deployment fit nicely into the special needs of AI/ML. “You can build a very similar delivery pipeline for AI/ML as you would for microservices,” said Knaup. This is another reason why running K8s for such workloads makes sense.
Cloud-native brings elasticity and resource allocation for AI. AI/ML tends to require very elastic computing—as you train a model on a new data set, the process can become quite resource-heavy and drain GPUs. And if many data scientists are building models and competing for resources, you need a smart method to allocate resources and storage. Cloud-native schedulers can solve this issue by smartly allocating resources. Some toolsets, like Fluid and Volcano, are explicitly designed for AI/ML scenarios.
You reap the agility of open source. Open source cloud-native projects tend to move very quickly when the community works together. This is similar to the activity around AI/ML open source tools like Jupyter Notebook, Torch or Tensorflow, which are cloud- and Kubernetes-native. Though there are concerns about the security of open source software, at the end of the day, the more eyes we have on open source, the better. “Since AI will be built into so many things, we’re going need to be able to scrutinize what decisions AI makes,” explains Knaup.
Cloud-native doesn’t mean cloud-dependent. First, a machine learning model must be trained on a large data set. It’s typically far more cost-efficient to run the heavy number-crunching AI on-premises than in the cloud. But, after these models are trained, organizations will likely want to perform inferences on the edge, closer to where new data is being ingested. Kubernetes is great in this regard, as it’s flexible enough to run in these different operating environments.
“Data has gravity,” Knaup says, and the compute should follow it. Using K8s as an abstraction layer, you can architect it once and run it in any environment, whether a security camera system, on the manufacturing floor or even aboard F-16 fighter jets.
Using AI/ML to Help Improve Cloud-Native
On the other hand, there are plenty of ways artificial intelligence could help manage and optimize cloud-native technology. “You can make an endless list,” says Knaup.
Using AI to automate root cause analysis. First, AI could help human operators diagnose problems with their cloud-native tools more efficiently. Kubernetes is quite complex and might integrate with many other components, such as service mesh for ingress control or OPA for policy management.
When a failure occurs in such a complex distributed system, it’s often challenging to piece together the root cause of the problem. Engineers must wrangle metrics and data sources from many sources to debug the issue. In doing so, they often follow a similar set of patterns in aggregating this data. Using AI to find these patterns could help human operators diagnose problems more effectively. This would speed up time to resolution, which would, in turn, increase general availability and security.
Using AI to predict and prevent issues. Another prospect is using AI to detect and prevent problems entirely. In marketing, it’s common to use end-user data to inform predictive analytics. But, if we applied predictive analytics to cloud-native statistics, what valuable data could we uncover? For example, suppose a monitoring tool can predict that, based on past usage, a specific disk will be 80% in four hours. Platform engineers could thus make the appropriate changes with ample time to avoid any service interruption. Such predictive service level indicators could become another helpful benchmark for SREs.
Using AI for performance optimization. There’s ample room for AI to suggest performance optimizations to fine-tune how cloud-native infrastructure runs. The results could inform what knobs to tune to adjust computational efficiencies or how to best schedule machine learning workloads.
When we consider AI/ML and cloud-native, it’s a win-win. Cloud-native technology can support the goals of AI/ML in terms of elasticity, scalability and performance. Simultaneously, there are many benefits that AI can bring to optimize how cloud-native architecture is maintained.
AI is a burgeoning field, with thousands of algorithms now in the open source realm. TensorFlow Hub alone has hundreds of free, open source machine learning models for working with text, images, audio and video. This is why Knaup recommends betting on an open source strategy for AI.
However, working successfully with AI will boil down to discovering the suitable algorithm for your use case. While there are a relatively small number of algorithm classifications, finding which is best for your problem and applying it to your situation requires domain expertise, explains Knaup. “You need to understand the problem space and how to apply those best-of-breed AI algorithms,” he said.