How to Scale End-to-End Observability in AWS Environments

Getting Buy-in from C-Levels for Error Budgets and SLOs

In this blog post, we'll look at how to encourage C-Levels to adopt SRE best practices such as SLOs and error budgets by providing the correct metrics for decision making.


    Originally published on Failure is Inevitable.

    You have postmortems implemented, automated, and well-structured. You're generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you're making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic effort will be proving to your C-levels why they should buy into SRE.

    The situation

    You’re moving from an incident-driven, reactive mode to a proactive mode. You're elevating reliability to help innovation and guide decisions across the software lifecycle. You’ve assigned a metric to it and are upholding it. You’re looking onto the next phase.

    This phase revolves around well-defined SLOs and SLIs hooking into the right parts of the system. You'll need your business teams agreeing on the SLO, error budget thresholds, and what will happen in case of a threshold breach. To propose this, keep two key thoughts in mind.

    1. What does your error budget policy include? We define error budget policies as including SLOs, SLIs, and error budget responses.
    2. Organization-wide adoption of SRE will be a large undertaking for your C-levels. Your CEO/CTO/CIO will need company-wide support to connect engineering, product, and business units. So, your incentives need to persuade them.

    The incentives

    These incentives won’t be what you see day-in and day-out from SRE. Instead, these will be the ones that your C-level is most excited by.

    • Long-term competitive advantage: Protect customer experience compared to competitors and increase customer loyalty.
    • Growing complexity of tech stacks and dependency on microservices: Issues worsen if unaddressed. As we move toward a world of complex, distributed systems, the way we operate must evolve to support that. This is the chance to catch up.
    • Reliability is feature No. 1: If a user can’t access your service or has a degraded experience, then features are irrelevant. Reliability is the foundation that all other features build upon.

    Of course, you can expect resistance towards adoption, even with these high-level incentives.

    The resistance

    What this will come down to is company priorities. C-level executives might not see the link between business performance and reliability. This is because incentives are often aligned toward new product innovation. So, it may be difficult to convince them that SRE should be a company-level priority. But, by leveraging both an emotional and logical appeal, you can succeed.

    The emotional appeal

    Here, we lean on customer impact. Everyone at the C-level cares about customer happiness. Satisfied customers cultivate pride, while dissatisfied customers create fear.

    Additionally, there is a significant financial aspect involved. Without SRE, organizations would have direct customer impact via SLA losses. That can be very expensive and hurtful to the brand and customer trust. If the reliability issues are too disruptive to overlook, customers may churn. The data you can collect from the cost of downtime can indicate how reliability affects your brand value.

    To avoid triggering an SLA breach, you'll need to adopt SLOs. These often act as a safety net, letting you know when you’re in danger before you need to start sounding the alarms. To prove to C-levels that SLOs are crucial, you can do two things.

    1. Quantify the cost of downtime (e.g. SLA losses) and estimate a bottom line for reliability impact.
    2. Show them your organization's NPS (or net promoter score) alongside a detailed customer satisfaction survey to correlate the score with reliability.

    The logical appeal

    Need for a competitive advantage: When you share similar services as your competition, you look like a less viable option when a competitor is able to respond to, recover from, and prevent incidents better than you. SLOs are an important lever to understand your product and customer experience to stay ahead of the competition.

    Show your executive the metrics on the SLOs and explain how they are set to optimize performance of most important paths in the user’s journey. Consider bringing the amount of data and access points in the cloud and the number of services the company depends on. This shows the need for a system that can adapt to the complexity moving towards cloud and microservices.

    Proactive is always better than reactive: SLOs and the use of error budgets help us move from a reactive mode (knowing that incidents will occur but not where and why), to a proactive mode of anticipating areas of risk and failure. Error budgets with negotiated terms between the business and engineering teams allow teams to respond in the right way by standardizing actions and protocols.

    To prove this, you’ll need two metrics:

    1. Automated reporting on incidents, SLOs, and error budgets that highlight risk areas before customers impact.
    2. A map of all areas of customer impact which could have been prevented with this knowledge.

    With these metrics and appeals to both the emotion and logic of your C-level executives, you’ll be able to convince them that investing in SRE is a strategic initiative that impacts the success of the entire company.

    If you liked this piece, consider reading these:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …