As we close out 2021, we at Container Journal wanted to highlight the most popular articles of the year. Following is the fourteenth in our series of the Best of 2021.
According to the DevOps Institute site reliability engineering (SRE) is a discipline and a role that incorporates aspects of software engineering and applies them to infrastructure and operations problems to create ultra-scalable and highly reliable distributed software systems. The DevOps Institute’s SRE blueprint identifies Nine Pillars of engineering practices which are: site reliability leadership and culture, work sharing, monitoring, SLOs and SLIs, error budgets, toil reduction, deployments, performance management, incident management and anti-fragility. In this blog, I explain how Kubernetes supports the nine pillars of SRE.
Leadership and Culture – This SRE pillar refers to the human aspects of striving for reliable services as they scale, shifting left the “wisdom of production” mindset from operations into development, and the relentless pursuit of continuous improvements. SRE practices foster a culture of continuously improving services and systems that support production, to ensure services continue to be reliable as they scale. Kubernetes capabilities support the culture and motivate all types of stakeholders.
Businesses, whose success increasingly depends on a company’s ability to deliver digital services and software quickly, use containers with Kubernetes as essential tools for building, deploying, and running applications at scale. Kubernetes helps business leaders empower the organization to reach next-level performance, deliver innovative new software and features faster, and enable multi-cloud operations for greater agility and resilience. Self-service infrastructures enabled by containers and Kubernetes allows teams to access the resources they need when they need them. This results in more productive, happier teams with fewer impediments, and dramatically faster times to deploy, collaborate, get feedback and restore services when needed.
Work Sharing – This SRE pillar refers to the practice of working technical debt in small increments and managing workload percentages for Ops, Devs and on-call work. SREs deliberately and proactively work to shift-left the “wisdom of production,” to educate and facilitate operations viewpoints and requirements into the services development team’s knowledge and activities. The “wisdom of production” informs better system design and behavior. To accomplish this, SREs spend a prescribed amount of their time (say, 50%) doing ops-related work, such as issue resolution, and 50% of their time on development tasks, such as creating new features, scaling or automation. The ability of Kubernetes to spin up multiple container instances with different scaling policies makes it a perfect fit for DevOps CI/CD jobs and activities. For example, a container instance can include build and test resources, in addition to applications, to create on-demand clusters on a container-capable CI worker node.
Measurement – This SRE pillar refers to the practices of using observability, monitoring, telemetry and instrumentation to measure the performance of applications. Modern enterprise applications require many microservices applications distributed across multiple nodes to communicate with each other. A single point of failure can stop the entire process, but identifying the source of failure can be difficult. Kubernetes monitoring is easier when the containers make monitoring data available. Cloud-native applications are constructed with health reporting metrics to enable the platform to manage life cycle events if an instance becomes unhealthy. They produce (and make available for export) robust telemetry data to alert operators to problems, and allow them to make informed decisions. Kubernetes supports liveness and readiness probes that make it easy to determine the state of a containerized application. Kubernetes integrates well with monitoring tools, like Prometheus and Grafana, for creating a Kubernetes monitoring dashboard.
SLOs and SLIs, Error Budgets – This SRE pillar refers to the practices of using service level objectives (SLOs) and service level indicators (SLIs) to measure the availability, latency and response time, together with error budgets to manage services reliability. Application-level monitoring is favored for SLOs, due to its direct link to customer relevance compared to system level monitoring. In complex environments, with services that depend on the reliability of many containerized microservices, SLOs and SLIs depend on a clear understanding of the elements that contribute to the measurements. Kubernetes provides a consistent description of large-scale deployments of containers, which enables a structured architecture for injecting SLIs across all types of deployment platforms.
Toil Reduction – This SRE pillar refers to the practice of reducing non-value added work using tooling and automation. Kubernetes eliminates many of the manual provisioning and other tasks of enterprise IT operations. In addition, the unified and automated orchestration approaches offered by Kubernetes simplifies multi-cloud management, enabling more services to be delivered with less work and fewer errors. The portability of application and orchestration management across private and public cloud platforms and operating system versions, allow developers and DevOps teams to build applications and pipelines without worrying about the underlying infrastructure and operating systems that the applications need to run on.
Deployments – This SRE pillar refers to the practice of gradual releases with deployment strategies such as green/blue, feature-flag and Canary with scripts to assist automation of tasks for deployment, testing and monitoring. Kubernetes offers many capabilities that allow one container to support many configuration environment contexts. The flexible deployment topologies afforded by Kubernetes supports advanced deployment and test scenarios such as chaos engineering or A/B testing on real production sites, reducing risk and improve resiliency. This avoids the need for specialized containers for different environment configurations. Declarative syntax used to define the deployment state of Kubernetes deployed container clusters greatly simplifies the management of the delivery and deployments. Since Kubernetes is used in production, the testing stages in the pipeline are more realistic and better match the configurations that applications will face during production.
Performance Management – This SRE pillar refers to practices for application performance monitoring, capacity testing and auto-scaling. Kubernetes can easily scale infrastructure and worker node configurations to match variable demands for different levels of integration and test workload demands. With Kubernetes, infrastructure resources are clustered and can be consumed and released elastically, enabling seamless scaling and higher resource utilization.
Incident Management – This SRE pillar refers to practices of emergency response, management of on-call workloads for SREs (for example, 25%) and blameless retrospectives. Kubernetes supports this SRE pillar with capabilities to create, validate and manage Kubernetes clusters from a single user terminal. Remote visibility, health checks and alerts of pods and clusters assist with on-call support. SREs can upgrade the Kubernetes stack along with different frameworks used in the setup and apply patches to clusters.
Anti-Fragility – This SRE pillar refers to practices of improving resilience using fire drills, chaos monkey, security and automation. Complex applications operating over complex distributed infrastructures can be difficult to maintain and secure. Cloud-native tools such as Kubernetes provide more insight into what is happening within an application, making it easier to identify and fix problems. Kubernetes will restart pods that are unhealthy. The enhanced orchestration controls provided by Kubernetes on the deployment and deployed containerized applications benefit from immutable consistency and improved response times. Kubernetes secrets objects offer a secure way to store sensitive data. Cloud-native tools such as Kubernetes provide more insight into what is happening within an application, making it easier to identify and fix security problems. The enhanced orchestration controls provided by Kubernetes on the deployment and deployed containerized applications benefit from immutable consistency and improved response times. Kubernetes enables modular distributed services better able to scale and recover from failures.
What This Means
Kubernetes supports the nine pillars of SRE through its capabilities to standardize the definition, architecture and orchestration of containerized applications and infrastructures suited for ultra-scalable and highly reliable, distributed software services.