How to Scale End-to-End Observability in AWS Environments

What is a Kubernetes Operator and why it matters for SRE

What is a Kubernetes Operator and Why it Matters for SRE.png

In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.


    Originally published on Failure is Inevitable.

    Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud native companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual convention!

    What has made Kubernetes become such a fundamental service? A major factor is its automation capabilities. Kubernetes can automatically make changes to the configuration of deployed containers or even deploy new containers based on metrics it tracks or requests made by engineers. Having Kubernetes handle these processes saves time, eliminates toil, and increases consistency.

    If these benefits sound familiar, it might be because they overlap with the philosophies of SRE. But how do you incorporate the automation of Kubernetes into your SRE practices? In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

    What the Kubernetes Operator can do

    In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

    Kubernetes Operators complete sophisticated tasks

    The Operator can complete complex tasks in order to achieve the desired changes in the application’s output. It can automatically handle such tasks as:

    Deploying applications

    Updating applications to new versions

    Reconfiguring application settings

    Scaling applications up and down depending on usage

    Failure handling

    Setting up monitoring infrastructure

    Without Kubernetes Operators, engineers would need to complete these tasks. Automating them saves time and toil, and makes the procedures and results consistent.

    Kubernetes Operators control custom resources and applications

    Kubernetes allows you to create and define custom resources based on specific applications. The custom resource is a data object generated by your application containing metrics on the application's state. Imagine you have an application that produces new server instances based on usage. You could define your custom resource to check RAM and disk space for each new instance. You can also define a custom resource as a target that the application is trying to match. The Kubernetes Operator can then control the application to achieve the target custom resource; if the application is spinning up servers that have insufficient RAM or disk space, the Operator can reconfigure the settings to match the desired amount.

    Kubernetes Operators make stateful decisions

    The Kubernetes Operator is able to modify the configuration and usage of an application based on the application’s output. This is determined by the custom resources defined for that application. Custom resources showing the desired state and custom resources showing the current state form a loop. The Operator observes the current state and then takes actions that will make the application produce the desired state. After the actions are executed, the current state is reevaluated and the loop begins again.

    For example, a custom resource could define the desirable state of a new server instance as some amount of load capability based on its physical resources.The Operator would then adjust the configuration until new instances reached these standards.

    Kubernetes Operators and SRE

    If you’re using Kubernetes, you’ll find that building and implementing Operators aligns with your SRE goals.

    Operator monitoring, SLIs, and SLOs

    When developing the custom resource for your application, you need to choose which signals from the application’s output will be monitored by the resource and which targets the Operator will steer the application toward. This is similar to creating SLIs and SLOs.

    The process of determining metrics with greatest impact is similar for Operators and SLIs. In the Kubernetes Operators textbook, Dobies and Wood suggest looking first at the “four golden signals” (a concept from Google’s SRE book) to determine what the Operator should monitor. These are:

    Latency

    Traffic

    Errors

    Saturation

    Creating Operators for your applications will help you understand what SLIs and SLOs should be set for them. Likewise, setting SLIs and SLOs can help you understand what your Operators should monitor.

    You might notice that when servers are overloaded, your customers are unhappy with the application’s availability.

    You can set a custom resource to monitor the disk space available. At 5% remaining capacity, your custom resource will spin up new server instances, giving your customers better service. Your SLI will be based on availability and will monitor disk space. Your SLO might dictate that you need to achieve 99.9% availability to keep your customers happy, informing the Operator’s intervention points.

    Automating SRE application deployment

    Your SRE practice will involve applications being deployed on a regular basis for each new instance of a service. For example, you may want to deploy a monitoring application every time you implement a new area of system architecture. Kubernetes Operators can expedite and automate this process. For monitoring, the Prometheus Operator is one of the first Operators developed by Kubernetes. It automatically deploys and controls a new instance of the open-source monitoring software Prometheus onto any targeted clusters.

    SRE tools represent an investment in reliability. The time spent implementing them is paid for by the time they save. Creating Operators is a similar investment. By creating Operators, you save time on each deployment. Furthermore, deployments are consistent and reliable. Your SRE practices have less overhead and can scale with your organization.

    Operators and incident management

    Operators can be set up to make adjustments to handle failure. If the application’s custom resource varies from the desired result, the Operator will make changes to compensate until the desired state is achieved. The cause of the variation is irrelevant to the Operator. It only operates based on the current and desired states. You will still need to work through an incident retrospective to bubble up contributing factors.

    When developing your incident response plan, the behavior of your Operators can be a valuable resource. If you know that the Operator will automatically try to correct the behavior, you can incorporate that into your expectations and procedures. For example, if you have an incident response plan for oversaturated servers, your Operator could spin up new server instances or reconfigure load balancing. Your response plan would take this into account, saving you some troubleshooting steps and allowing you to focus on the originating issue. By combining Operators and automated runbooks, you can minimize the amount of manual escalation and resolve many incidents without human intervention. As automation is another core goal of SRE, this is another way that Kubernetes Operators fit into your reliability strategy.

    As you shift your services to a container-based model and Kubernetes becomes more fundamental to your DevOps practices, it’s important to incorporate Operators into your reliability strategy. Operators allow you to extend Kubernetes with custom resources and responses, allowing for more automation and less toil.

    If you enjoyed this post, check out these resources:

    Webinar: Modern Metrics to Understand Operational Health

    How to Choose Monitoring Tools for DevOps and SRE

    How to Classify Incidents


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …