How to Scale End-to-End Observability in AWS Environments

Here are the Important Differences Between SLI, SLO, and SLA

Here are the Important Differences Between SLI, SLO, and SLA.png

In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.


    When embarking on your SRE journey, it can seem daunting to decipher all the acronyms. What are SLOs versus SLAs? What’s the difference between SLIs and SLOs? In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.

    What’s the difference between SLI, SLO, and SLA?

    Below are the definitions for each of these terms, as well as a brief description. Definitions are according to the Google SRE Handbook.

    SLI: “a carefully defined quantitative measure of some aspect of the level of service that is provided.”

    SLIs are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency or availability, which are defined as response times, including queue/wait time, in milliseconds. A collection of SLIs, or composite SLIs, are a group of SLIs attributed to a larger SLO. These indicators are points on a digital user journey that contribute to customer experience and satisfaction.

    When a developer sets up SLIs measuring their service, they do them in two stages:

    1. SLIs that will directly impact the customer.
    2. SLIs that directly influence the health and the availability or the latency and performance of certain services.

    Once you have SLIs set up, you move into your SLOs, which are targets against your SLI.

    SLO: “a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.”

    Service level objectives become the common language that companies use that allows teams to set guardrails and incentives to drive high levels of service reliability.

    Today many companies operate in a constantly reactive mode. They're reacting to NPS scores, churn, or incidents. This is an expensive, unsustainable use of time, and resources, let alone the potentially irrecoverable damage to customer satisfaction and the business. SLOs give you the objective language and measure of how to prioritize reliability work for proactive service health.

    SLOs defined.png

    SLAs: “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”

    Service level agreements are set by the business rather than engineers, SREs, or ops. When anything happens to an SLO, SLAs kick in; they're the actions that are taken when your SLO fails and often result in financial or contractual consequences.

    Key metrics slo sla sli.png

    How these terms help with reliability: an example case study

    Imagine an organization is looking to increase reliability. The company has recently begun investigating expensive SLA breaches and wants to know why it’s reliability is suffering. This organization breaches its SLA for availability almost every month. As it onboards more customers with SLAs, these expenses can grow if it doesn’t meet its performance guarantees.

    This fictitious organization is also dealing with low NPS scores. The team is aware of the problem, but NPS scores are a lagging indicator with respect to customers that have already begun to churn. The team met to discuss what needs to be done. The first step to this is breaking down the company’s SLIs.

    Identifying SLIs that matter to the user

    The team knows it needs to examine availability and set SLOs for it, so it begins looking at the user journey. The QA team has already done some documentation, so the team refers to the user journeys outlined there and augments this documentation with their own journeys.

    The team identifies critical points that receive the brunt of complaints. Team members also look into black box monitoring, a tactic that helps identify issues from a user’s perspective. With black box monitoring, the team acts as an external user of the service with no access to the internal monitoring tools. This allows team members to concentrate on a few metrics that directly correlate with user happiness.

    After looking at their user’s journey, the team determines that the individual load pages of each tab on the expenses feature don’t load slowly individually, but when someone needs to skim through 2 or more pages, it becomes tedious. So the team also decides to create an SLO for response time at the load balancers as well.

    Establishing corresponding SLOs

    After the team determines its SLIs, it’s time to set up the SLOs. The team is looking at availability of the site (a common complaint), as well as the latency issue on the expense page. While the team plans to add more SLOs later, these two will serve as the guinea pigs.

    For the latency issue, the team sets an SLO for all pages to load in under 1 second. This faster load time means that users won’t be irritated scrolling through multiple pages. The team then moves on to the availability SLO.

    Based on traffic levels, customer usage, NPS scores, the team has determined that its customers are likely to be happy with 99.5% availability. On the other hand, data from previous months suggests customer satisfaction and usage doesn’t seem to increase when uptime is greater than 99.9%. This means that there’s no reason to optimize at this point for higher than a 99.5% uptime metric.

    With SLOs in place, the team will need to work on what to do if these targets are missed by creating an error budget policy. This policy will detail:

    • The acceptable level of failure in the system over a given period of time (the error budget)
    • Alerting and on-call procedures for the service
    • Escalation policies in the event of error budget depletion
    • An agreement to halt feature development and focus on reliability after a certain amount of time where the error budget is exceeded.

    Once everyone agrees, the SLOs are launched. The team watches carefully and reiterates at the monthly error budget meeting. After a few months, the team feels confident enough to add more SLOs.

    Agreeing on SLAs

    SLAs are an external metric, therefore not goaled the same way as SLOs. SLAs are a business agreement with users that dictates a certain level of usability. The engineering team is aware of SLAs, but doesn’t set them. Instead, the team sets SLOs more stringently than the SLAs, giving themselves a buffer.

    For example, the team’s 99.5% availability SLO means the service can only be down 3.65 hours per month. However, the SLA that the organization signs with users specifies that it must maintain a 99% availability. This means the service can be down 7.31 hours per month. The team has a buffer of 3.66 hours per month. Now, the team can work on new features with guardrails for reliability. The organization will benefit from happier users and the team has the confidence to innovate while remaining reliable.

    When used together, SLI, SLO, and SLAs are powerful tools that allow you to provide the best for your users. While it can be tricky to get these metrics right, a culture of revision, iteration, and blamelessness will help you achieve your reliability goals.

    If you liked this article, check out these resources:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …