Assessing the True Cost of Kubernetes

August 17, 2021August 16, 2021 Mike Vizard cluster management, container management, D2iQ, kubernetes

In this episode of The View with Vizard, Mike Vizard talks with D2iQ CTO Deepak Goel who explains how to assess the true total cost of Kubernetes. The video is below, followed by a transcript of the conversation.

Announcer: This is Digital Anarchist.

Mike Vizard: Hey, guys. Thanks for the throw again. We’re here with Deepak Goel who’s the CTO for D2iQ. Deepak, welcome to the show.

Deepak Goel: Thanks, Mike. Thanks for having me.

Sponsorships Available

Vizard: You guys are specialists in managing clusters like Kubernetes, and it seems like a lot of folks now have at least one or two of these clusters running, and I think they’re kind of running into some issues and one of those issues is cost. So is the cost of running Kubernetes starting to climb faster than people anticipated?

Goel: Yeah. That indicates that the adoption of Kubernetes is growing. And so I’m not too worried about the growth in statistics about cost if it’s the cost that’s increasing with the usage or adoption of Kubernetes. However, what we have also seen is that there’s a cost that is unexpected to some extent, and there’s various reasons for that.

Vizard: Well let’s get into some of those reasons. I mean, is it just waste, is it inefficient use of the underlying infrastructure? What are people doing? Originally I thought most people were going to have one or two large clusters and now they’re managing, some of them, at least fleets of clusters. So what are the underlying management challenges?

Goel: Yeah. There are many, and some of it is also related to the mindset that we have been carrying from traditional datacenter time. So what happened, if you take a step back, is in infrastructure we have always been more biased towards overprovisioning because back in datacenter days it was hard to do the entire procurement cycle. It was very slow. So you would used to procure more than expected just to have unblocked development innovation happening.

But when we moved to the cloud, some of that mindset still exists, which is to say we don’t know how much resources we would need so let’s overprovision those resources in cloud even though cloud provides the elasticity for you to cull, so something like a model of pay as you go. Even though we are multiple years into Kubernetes but industries are getting much comfortable with Kubernetes. And so some of the constructs that allows you to use that elasticity in Kubernetes hasn’t been baked into regular workflow. And then that’s traditionally also coming from the way that things have been divided.

So we have seen that most of the automation has centralized IT or an IT team that’s much more responsible for creating the infrastructure or at least the environment. Ideally it was on prem, so it was in datacenters. They were responsible for provisioning the machines, they were responsible for setting up the network storage, and then they used to handle with this infrastructure to the development team to do their day-to-day innovation or development. Now with cloud it’s the same structure with the same mindset where we have now a cloud team that does the provisioning in the cloud, and then they hand over this infrastructure to the development team.

And so the disconnect between these two teams leads to some of the cost leak I would say. But I would categorize these costs as intentional costs, because you are intentional about overprovisioning your infrastructure. However, with Kubernetes what we have also seen is unintentional cost, which is when you’re doing a clean up certain resources are left behind, and those resources knowingly accumulate cost and that’s primarily because there’s a disparity in tooling currently. So tooling is much more biased towards creating an environment, because as a development you first want to make it smoother to create an environment. However, the destruction of that environment isn’t that smooth.

So certain resources and also frankly speaking it’s a difficult problem to destruct something, especially if you’re dealing with storage. Now whether that storage needs to be deleted or not depends on pretty much the workflow of the use case of that storage. Similarly, there are other resources which can blindly be taken out during destruction time. And so the cleanup of resources is very much dependent on an understanding of a workflow as well as of the environment, like whether the environment was a production environment, whether that environment was a temporary dev environment, whether it was running a _____ workload or things like that.

So if you say cloud team it could just be the _____ resources. That may not always be possible because they need to understand what was the use case of those resources. So there are some overprovisioning, as you were saying, and then there are some left behind that are causing unexpected bump in the cost.

Vizard: So it sounds twofold. There’s always not a great appreciation for what it means to orchestrate because you can scale up and scale down, so I don’t have to overprovision, but then when I do scale down, like most things with Kubernetes, it’s not as straightforward as we like to think and then that winds up with a lot of waste because there’s a lot of stuff that hasn’t been cleaned up. Is that about right?

Goel: Yeah. That’s about right. So there I would make a comment that Kubernetes has enough constructs like, as you said, we have horizontal part auto scaler and we have vertical part auto scaler, so there are enough constructs in Kubernetes that allows you to leverage the elasticity in the cloud, and so you don’t necessarily have to overprovision that. However, at D2iQ we haven’t seen that widespread adoption of those metrics. And so that least to overprovisioning. And as you said, Kubernetes is also very complex, especially when it is interacting with cloud constructs or cloud services, and many times Kubernetes experts, even though they know in depth the resources in Kubernetes they don’t know what could cost.

Like somebody needs to also understand how is the cost being charged in a particular environment? Like there’s a network cost which is measured, amount of data is transferred, then it’s not as simple as just saying network cost because sometimes if it is happening within a region it’s no charge whereas if it is happening in another reason. So there are many minute details of cost analysis which many times is not available to one team. It’s like this is a distributed or travel knowledge for different teams.

Vizard: Right. And of course if somebody else is doing it, it must be free, right? Let me ask you, who is managing Kubernetes clusters in these environments? You hear a lot about SREs, but as far as I know they’re kind of hard to find and even harder to retain. So are we making Kubernetes accessible to mirror IT administrator mortals? Is that something they can do now or where are we on this journey?

Goel: Yeah. That’s a good question. It’s all over the place at the moment I would say, and there’s no fixed sort of teams that we have seen in our experience at D2iQ. It’s like in some cases there’s a special team that is maintaining Kubernetes. That team is formulated as the go-to team for any Kubernetes-related queries in the organization, but then that team doesn’t necessarily have the cloud expertise.

So you have a team that knows in and out a given cloud environment, but then they may not have Kubernetes experience. Similarly, you would have a Kubernetes environment which do not necessarily have all the cloud expertise. And this structure sometimes causes challenges like this unexpected cost.

Vizard: Is this going to get more complex as we start to roll out more of the stateful applications on Kubernetes clusters? Because now it seems like there’s more data involved. Stateless apps, of course, they store data somewhere else and somebody else manages that, but now is that all converging together and what does that mean for the management challenges ahead?

Goel: Yeah. You are right, if it keeps going without necessary tooling around, these complexities, which could abstract out these complexities, then we would see a rise of unexpected cost. But however, we are very much focused at D2iQ to a similar thought process, like how we can make it easier for our customer with necessary tooling that things are taken care by themselves. Like if I give you an example off of programming language, like it’s the same problem of garbage collection where memory leaks used to happen. When we started with earlier programming languages garbage collection was a big thing.

But as programming language matured in their experience, almost every new programming language comes with their own garbage collector and we have matured in that area. I expect that there would be similar maturity in Kubernetes where as things are becoming more complex there’s necessary tooling that abstract those complexities from the end user and taken care.

Vizard: We hear a lot, of course, about DevOps all the time and I feels like Kubernetes is managed frequently in the context of a DevOps workflow, but now we also hear a lot about GitOps. So where do you think GitOps fits in this whole workflow and what is different about GitOps versus what we think about DevOps in the context of Kubernetes?

Goel: Yeah, I would say GitOps is one of the many implementations to get your DevOps. As you would say, DevOps is all about automation and how things can be automated, repeated in any given environment. And Git or GitOps becomes a method or a tool to achieve that goal. Now what Kubernetes is helping is being a declarative first implementation in Kubernetes. If you see all the APIs, all the way the tools interact with the PI server, they are all declarative, and that’s what makes it very aligned with the GitOps and bringing the necessary automation. And so the more people move towards a declarative workflow the more they will be able to utilize GitOps and the automation, so it helps fuel the DevOps.

Vizard: Right. So do you think that perhaps the total cost of IT can even decline even if there are inefficiencies when I move to Kubernetes because I am having this more efficient, continuous delivery mindset, I’m more automated, I’ve got a consistent set of APIs in the target platform, so I might actually wind up with a more agile environment that costs less than my existing monolithic environment, which is a lot harder to manage?

Goel: Exactly. In fact, you were spot on there. That’s actually the goal I would say definitely for us at D2iQ but I think in general of the Kubernetes community, is to get to a place where things are automated with the right tooling such that the overall cost of owning a particular infrastructure or a cluster or multiple fleets of cluster goes down. Currently, if you’re seeing traditionally even from prior to adoption of these clouds cloud native apps, this cost used to increase as you increased broad infrastructure because for every new infrastructure it was a heavy lift of procurement and setting up the environment and things like that. So those costs have started to go down, and I expect that these will continue to go down as we make sure in these environments with Kubernetes.

Vizard: Is there a role or AI in terms of bringing down the total cost of Kubernetes and making it simpler to manage?

Goel: Definitely. The way we see at D2iQ we call them smart apps. These days most of the apps, if not all, are coming with some AI element built into them, and that’s because they are continuously receiving this data and they continuously need to dynamically respond to that data. We are seeing apps in connected cars. We are seeing apps in medical equipment.

All of these apps, which we collectively categorize as smart apps, need a smart platform. And what I mean by smart platform is we have current level of tooling that kind of give you observability into the cluster, but then it still needs some sort of human interaction in terms of triaging and picking out. But as we mature towards this tooling we could also come to a field that you might have heard is AIOps, where it indicates an incident or something happening even before it happens, or even if it happens it autocorrects itself.

So as those operations mature, it will also, by the way, bring down the IT cost again. So we are moving towards that direction and starting to see the benefits of having an AI in the operations.

Vizard: Do you think that too many people today are just intimidated by Kubernetes? There’s too many things to turn and knobs to track and they don’t go down the path because they look at it and they go it’s just too hard. So can we make it more accessible?

Goel: Yeah. That’s interesting Because one of the points which made Kubernetes so popular was a low entry barrier, and we saw an exponential growth just because it was so easy to interact with the API server using a very standardized command line tool. But I do acknowledge that there are areas where now people are more and more adopting so they want to know what’s going on under the hood, and it’s good to understand at that level for certain teams. But I think most of the toolings are aiming towards making it – building an abstraction layer on top of Kubernetes so that most of the things are taken care automatically for those applications. Now it will take some time because these toolings have to go and mature in their own lifecycle, but the aim of the overall CNCF I would say is to hide or at least abstract out the complexity so that you don’t have to fix those knobs.

Those are intelligent defaults. That system comes up and it reacts to the need of DR. The example of one of those things is as I was saying is autoscaling feature that Kubernetes has, which reacts to the requirement of scaling of the application. So it already makes your application highly available, and the beauty of that is it works both ways. So it expands when there’s a need of more availability of the application depending on, let’s say, more traffic, but then it scales down when things go down.

So such kind of tooling is aimed to take out the complexity from the system. I would say the reason why we feel currently that it’s too complex is like we are in this transition phase where there are still things that say an app developer has to understand before they can start using an infrastructure. My hope is and my expectation is as we mature more and more we would be interacting more with the abstraction layers and not worry so much about how things are happening under the hood. We may still be curious about them and we can learn about them, but we wouldn’t need them necessarily to do our work or to do our development on such an environment.

Vizard: Is it also that it’s the environment and not just the cluster? There’s all kinds of stuff that I’ve got to layer in on top of Kubernetes, whether it’s Prometheus and all this other stuff, and all that’s got to be managed in context as well. So is that part of the challenge?

Goel: Yes. And that’s the challenge that we at D2iQ, what we call a day-2 operation, has been focusing on for quite some time and have baked into the product. So what happens is what we have seen in the past few years is that developers have adopted Kubernetes. They have become very familiar and very comfortable using it, and now enterprise are moving in and adopting Kubernetes in their production environment. We have seen in the recent CNCF survey that there was an uptick of adoption of Kubernetes in a production environment.

And since enterprises are coming in, enterprise comes with their own pre req for their adoption, like it should be a secure environment, it should be highly available, it should have monitoring, observability. So these things are getting more attention now and becoming more important. And I also expect that products that would come in would have these complexities being resolved for the end user.

Vizard: And one last thing that is becoming a bigger issue is security. So will security be managed as part of the whole day-2 operations? Is that all going to get folded in or where do we go from here?

Goel: Yeah. I know these days everybody has security in mind, the recent features that we have seen. Kubernetes already provides quite a few knobs. However, as we were talking, somebody needs to understand in the automation what those knobs are, and it’s not always easy to figure out all the knobs unless you spend some time with the product or with Kubernetes. But I expect that the product itself – and this is something we take very seriously at D2iQ – is we make security be a first-class citizen of the product, and that’s what – like there are best practices that are recommended by the community. However, we don’t expect our customer to know all those best practices when they start using it.

So can those best practices be built into the product by default? And the answer is yes, and that’s what makes one product more secure than another is that people who are building these products have gone through and thought about as the first-class citizen instead of an afterthought. And yes, as I said, Kubernetes has all, I would say, necessary knobs. It’s just that I would expect products to bake those knobs or have a same default for those knobs when they ship their product.

Vizard: Alright. So what’s your best advice to folks other than hanging out on your website, of course? But if you were running fleets of Kubernetes clusters or even a small number of large ones, what would you be focusing on? What would you be thinking about?

Goel: So one thing that is very clear is like organizations draw their ROI by making their application run in a secure environment. And so they need to pick and choose what they want to build in house and what they want to buy, because that would totally depend on their acceleration to hit the market and draw out basically revenue for their firm. And I would say at the surface it appears easy to do your own Kubernetes, but as you’ve discussed there are enough knobs so to speak or a configuration that one has to take care that sometimes makes it hard if you just make something open source Kubernetes and do it everything yourself. It’s a huge investment, which you need to decide whether that makes sense for your firm.

For a certain firm it might, but for other firms you might just want to buy instead of build the Kubernetes because you would get support for that infrastructure, and then you can concentrate as a firm on the most important aspect of your organization, which might be the application or the business logic that you want to learn. And so in that respect, cloud has done a good job but there’s still gaps, and we are talking the cost is one of those things that you need a reliable partner and organization that is thinking through all these problems before you and making sure that these are already resolved or already taken care of on your behalf in the product itself.

Vizard: Alright. Well I think what you just said is the most important thing is to have that Kubernetes gut check, because once you’re in it you’re in it forever. Alright.

Goel: That’s true.

Vizard: Alright. Deepak, thanks for being on the show.

Goel: Thank you for having me.

Vizard: Alright, guys. Back to you in the studio.

[End of Audio]