A Recipe for Enterprise DevOps With Kubernetes

April 23, 2021October 31, 2021 Mayank Kumar enterprise, Hyperforce, kubernetes, Salesforce

When we first embraced Kubernetes at Salesforce six years ago, we also accelerated our adoption of a DevOps culture. The two together help us quickly deliver value to our customers and, by following a simple recipe, can help you do the same.

DevOps answers the question, “How do software companies deliver value efficiently in this quickly changing world, at speed and with quality, while maintaining the trust of their customers?”

Some Background

Historically, we have used Puppet to deploy Docker, Etcd and Kubernetes control plane and worker nodes on bare metal. Puppet relies on systemd to manage the daemons on each machine. Puppet is also used to keep the machines updated with new dependencies or rolling out new versions of Kubernetes. The principles discussed here also apply to our public cloud architecture, Hyperforce.

DevOps Principles

The five principles below are all closely related and will help you reach the end goal of delivering value to your customers.

Collaboration between developers and operations
Automating processes
Culture of continuous improvement
Measuring everything
Focusing on customer needs

Services are jointly owned end-to-end; no more throwing code over the wall for Ops to deal with after it is written. Everything that happens to deliver software after a commit is made should be automated. There should be a short feedback cycle and quick adaptation based on that feedback.

Sponsorships Available

Operationalize It

We did not have a distinct operations team when we initially set up Kubernetes. We did the on-call rotation ourselves for three years and got to experience the fruits of our own labor, however sour or sweet they were, so we were highly attuned to operating our stack in the most efficient ways possible. During this time, we created runbooks while we learned (the hard way) what it took to operate the system we had built from scratch. This was a good lesson. Since we were doing a round-the-clock on-call schedule initially, we had to be really mindful of what we were building: did it alert properly or not? Could we easily debug it at 2 a.m.? Was it easy to discover the root cause of issues? Slowly, when we transitioned the runbooks and on-call to our operations team, we streamlined the process. When operations reported a problem, we understood and empathized with them, because we had been in their shoes and knew what it took to operate the infrastructure. We worked together with the operations team to fine-tune alerts and review and triage live site issues. Both teams are now involved early on in the design of new features so that we know they will be easily operable.

Automate All the Things

We’ve enabled a GitOps style deployment pipeline for internal microservices at Salesforce. Our customers are other teams inside the company who want to focus on their business logic. We also use managed Kubernetes solutions like EKS, GKE and AKS in the public cloud to automate cluster upgrades. We make sure to monitor alerts while doing these upgrades, and we have a way to pause the upgrades when things go south, especially because we operate a multi-tenant cluster. We also automate node patching and ensure our tenants specify a PDB so that patching doesn’t bring down the services running on the cluster.

Monitor and Measure All the Things

For monitoring, we use watchdogs, metrics and an alerts system based on opentsdb. All metrics go to this internal OpenTSDB database. Then, alerts can be set up based on these metrics and integrated with our on-call paging solution.

In our setup, everything worth measuring is a watchdog. We measure not only our services and infrastructure components, but also our dependencies, and we offer many dashboards in Grafana where users can pick and choose which watchdog(s) to focus on.

A few of our key metrics are:

Deployment frequency – how often an organization successfully releases to production
Lead time for changes – the amount of time it takes a commit to get into production
Change failure rate – the percentage of deployments causing a failure in production
Time to restore service – how long it takes an organization to recover from a failure in production

… To Keep Improving.

Continuous improvement and a fast feedback cycle go hand-in-hand and come about naturally when you have an efficient, low-latency CI/CD process combined with continuous measurement and efficient developer and operations collaboration. We and our customers (internal developers) test and deploy to production as often as we want (CI/CD), getting fast feedback. We measure and observe key business metrics related to the services we own and maintain and for which we are changing code (principle 4 in action!). We then make improvements to our code based on measured values and rinse and repeat the whole process. We also identify mundane or repetitive human processes and automate them (see principle 2).

DevOps is a culture shift that leads to shipping the right product, faster, and makes operating it in production easier and more reliable. The combination of Kubernetes, GitOps and cloud-native tools makes DevOps easier to adopt. But your whole organization needs to commit to DevOps; it can’t be only one team’s goal. The transition can take a long time and may initially be painful for all teams involved. DevOps requires deep empathy and collaboration between developer and operations teams.

Our adoption of Kubernetes six years ago accelerated our adoption of DevOps practices, as well, and helped enable our recent complete rearchitecture of Salesforce that takes advantage of the scale and agility of the public cloud. By following the five principles of DevOps discussed here, you can do it too!