What is a Kubernetes Operator and why it matters for SRE

In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

Originally published on Failure is Inevitable.

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud native companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual convention!

What has made Kubernetes become such a fundamental service? A major factor is its automation capabilities. Kubernetes can automatically make changes to the configuration of deployed containers or even deploy new containers based on metrics it tracks or requests made by engineers. Having Kubernetes handle these processes saves time, eliminates toil, and increases consistency.

If these benefits sound familiar, it might be because they overlap with the philosophies of SRE. But how do you incorporate the automation of Kubernetes into your SRE practices? In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

What the Kubernetes Operator can do

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

Kubernetes Operators complete sophisticated tasks

The Operator can complete complex tasks in order to achieve the desired changes in the application’s output. It can automatically handle such tasks as:

Deploying applications

Updating applications to new versions

Reconfiguring application settings

Scaling applications up and down depending on usage

Failure handling

Setting up monitoring infrastructure

Without Kubernetes Operators, engineers would need to complete these tasks. Automating them saves time and toil, and makes the procedures and results consistent.

Kubernetes Operators control custom resources and applications

Kubernetes allows you to create and define custom resources based on specific applications. The custom resource is a data object generated by your application containing metrics on the application's state. Imagine you have an application that produces new server instances based on usage. You could define your custom resource to check RAM and disk space for each new instance. You can also define a custom resource as a target that the application is trying to match. The Kubernetes Operator can then control the application to achieve the target custom resource; if the application is spinning up servers that have insufficient RAM or disk space, the Operator can reconfigure the settings to match the desired amount.

Kubernetes Operators make stateful decisions

The Kubernetes Operator is able to modify the configuration and usage of an application based on the application’s output. This is determined by the custom resources defined for that application. Custom resources showing the desired state and custom resources showing the current state form a loop. The Operator observes the current state and then takes actions that will make the application produce the desired state. After the actions are executed, the current state is reevaluated and the loop begins again.

For example, a custom resource could define the desirable state of a new server instance as some amount of load capability based on its physical resources.The Operator would then adjust the configuration until new instances reached these standards.

Kubernetes Operators and SRE

If you’re using Kubernetes, you’ll find that building and implementing Operators aligns with your SRE goals.

Operator monitoring, SLIs, and SLOs

When developing the custom resource for your application, you need to choose which signals from the application’s output will be monitored by the resource and which targets the Operator will steer the application toward. This is similar to creating SLIs and SLOs.

The process of determining metrics with greatest impact is similar for Operators and SLIs. In the Kubernetes Operators textbook, Dobies and Wood suggest looking first at the “four golden signals” (a concept from Google’s SRE book) to determine what the Operator should monitor. These are:

Latency

Traffic

Errors

Saturation

Creating Operators for your applications will help you understand what SLIs and SLOs should be set for them. Likewise, setting SLIs and SLOs can help you understand what your Operators should monitor.

You might notice that when servers are overloaded, your customers are unhappy with the application’s availability.

You can set a custom resource to monitor the disk space available. At 5% remaining capacity, your custom resource will spin up new server instances, giving your customers better service. Your SLI will be based on availability and will monitor disk space. Your SLO might dictate that you need to achieve 99.9% availability to keep your customers happy, informing the Operator’s intervention points.

Automating SRE application deployment

Your SRE practice will involve applications being deployed on a regular basis for each new instance of a service. For example, you may want to deploy a monitoring application every time you implement a new area of system architecture. Kubernetes Operators can expedite and automate this process. For monitoring, the Prometheus Operator is one of the first Operators developed by Kubernetes. It automatically deploys and controls a new instance of the open-source monitoring software Prometheus onto any targeted clusters.

SRE tools represent an investment in reliability. The time spent implementing them is paid for by the time they save. Creating Operators is a similar investment. By creating Operators, you save time on each deployment. Furthermore, deployments are consistent and reliable. Your SRE practices have less overhead and can scale with your organization.

Operators and incident management

Operators can be set up to make adjustments to handle failure. If the application’s custom resource varies from the desired result, the Operator will make changes to compensate until the desired state is achieved. The cause of the variation is irrelevant to the Operator. It only operates based on the current and desired states. You will still need to work through an incident retrospective to bubble up contributing factors.

When developing your incident response plan, the behavior of your Operators can be a valuable resource. If you know that the Operator will automatically try to correct the behavior, you can incorporate that into your expectations and procedures. For example, if you have an incident response plan for oversaturated servers, your Operator could spin up new server instances or reconfigure load balancing. Your response plan would take this into account, saving you some troubleshooting steps and allowing you to focus on the originating issue. By combining Operators and automated runbooks, you can minimize the amount of manual escalation and resolve many incidents without human intervention. As automation is another core goal of SRE, this is another way that Kubernetes Operators fit into your reliability strategy.

As you shift your services to a container-based model and Kubernetes becomes more fundamental to your DevOps practices, it’s important to incorporate Operators into your reliability strategy. Operators allow you to extend Kubernetes with custom resources and responses, allowing for more automation and less toil.

If you enjoyed this post, check out these resources:

Webinar: Modern Metrics to Understand Operational Health

How to Choose Monitoring Tools for DevOps and SRE

How to Classify Incidents

Get similar stories in your inbox weekly, for free

Share this story:

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.