Determining Error Budgets and Policies that Work for Your Team
In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.
Originally published on Failure is Inevitable.
SLOs are key pillars in organizations’ reliability journeys. But, once you’ve set your SLOs, you need to know what to do with them. If they’re only metrics that you’re paged for once in a blue moon, they’ll become obsolete. To make sure your SLOs stay relevant, determine error budgets and policies for your teams. In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.
Error budgeting basics
An error budget is the percentage of remaining wiggle room you have in your SLO. Generally, you’ll institute a rolling window versus historical purview into your data. This keeps the SLO fresh, monitored, and always moving forward. Error budget can be shown as the below calculation:
Imagine you’ve set an SLO for 99.5% uptime per month. This means your error budget is .5%. This is 3.65 hours of downtime per month. If an incident causes a 1.22 hour outage, you’ve lost approximately one third of your error budget for this month.
So, what does this information mean to teams? Your error budget policy will determine this.
Error budget policies
It’s not enough to know what your error budget is. You also need to know what you’ll do in the event of error budget violations. You can do this through an error budget policy. This determines alerting thresholds and actions to take to ensure that error budget depletion is addressed. It will also denote escalation policies as well as the point at which SRE or ops should hand the pager back to the developer if reliability standards are not met.
Alerting: Alert (or pager) fatigue harms even well-seasoned teams' ability to respond to incidents. This is the effect of receiving too many alerts, either because there are too many incidents, or because monitoring is picking up on insignificant issues (also known as alert noise). This can lower your team’s cognitive capacity, making incident response more difficult. It can also lead your team to ignore crucial alerts, resulting in major incidents going unresolved or unnoticed.
You’ll want to make sure that your alerting isn’t letting you know every time a small part of your error budget is eaten. After all, this will happen throughout the rolling window. Instead, make sure that alerts are meaningful to your team and indicative of actions you need to take. This is why many teams care more about getting notified on error budget burndown rate over a specific time interval, compared to the depletion percentages themselves (i.e. 25% vs. 50% vs. 75%).
To determine if you need to take action for error budget burn, write in stipulations. Stipulations could look something like this: if the error budget % burned ≤ % of rolling window elapsed, no alerting is necessary. After all, a 90% burn for error budget isn’t concerning if you only have 3 hours left in your window and no code pushes.
But, if burn is occurring faster than time elapsing, you’ll need to know what to do. Who needs to be notified? At what point do you need to halt features to work on reliability? Who should own the product and be on-call for it at this point? Add answers to questions like these into your error budget policy Google produced an example of what this document looks like. It contains information on:
- Service overview
- Policy goals
- Policy non-goals
- SLO miss, outage, and escalation policies
- Any necessary background information
Handing back the pager: In the example policy above, Google reminds us, “This policy is not intended to serve as a punishment for missing SLOs. Halting change is undesirable; this policy gives teams permission to focus exclusively on reliability when data indicates that reliability is more important than other product features.” If a certain level of reliability is not met and the product is unable to remain within the error budget over a determined period of time, SRE or operations can hand back the pager to the developers.
This is not a punishment. It’s a way to keep dev, SREs and ops all on the same page, and shift quality left into the software lifecycle by incentivizing developer accountability. Quality matters. Developers are held to task for their code. If it’s not up to par, feature work will halt, reliability work will take center stage, and SRE or ops will hand the pager over to those who write the code. This helps protect SRE and ops from experiencing pager fatigue or spending all their time on reactive work. Error budget policies are an efficient way to keep everyone aligned on what matters most, which is customer happiness.
Building a long-term process for operationalizing SLOs
The process that goes into creating SLOs, especially the people aspect, is extremely critical for consistency and ability to scale it across your entire organization. To operationalize SLOs, you’ll need to remember a few key things:
- You're not going to get it right the first time and that's okay. You need to have an iterative mindset to get the correct SLOs, thresholds, and teams in place. Patience and persistence are important.
- Review your SLOs on a weekly or bi-weekly cadence. Many of you have internal operational review meetings where you look at your key reliability metrics such as the number of incidents, retrospective completion, and follow-up action items. In that meeting, one of the key things you’ll want to make time for reviewing is your SLO dashboard.
- Review critical upcoming initiatives collaboratively. Determine if any planned updates or pushes are likely to exceed your error budget and plan to prevent this. Are you shipping as safely as possible? Attendees in this meeting should be from product, SRE, the core service engineering teams, and other stakeholders.
Once you’ve got these basics down, you can begin to expand your SLO practices.
Advanced SLO practices
Here are some additional, more advanced SLO practices that you can start using once you’ve found success with the basics:
- Composite SLOs: Combine two or more SLOs from different services to represent an end-to-end product view of reliability. This could include an SLO containing both availability and latency thresholds.
- Treating SLO violations as incidents: How do you treat a violation as an incident, and thread that into your incident management process? When we violate our SLO, we are affecting our users and customers. Those issues must be treated as incidents. Be sure to define the right severity levels for SLO breaches.
- Giving back error budget: You may have maintenance windows or services that must be unavailable in certain time periods. That may be normal, and will consume the error budget. You can give that error budget back, but make sure you document the reason why.
- Correlating changes to SLO: SLOs are not like diamonds; they're not going to be there forever. Ask yourself, “Are these still valid?” Your organization, your teams, and your product are always evolving and changing. Why should your SLOs be static?
Maybe you’ll be ready to take these advanced steps in a few months. Maybe it will take a few years. No organization’s SLO journey looks the same. The important thing to remember is that iteration, alignment, and a blameless culture are what’s core to your SRE practice. SLOs and error budgets are only components.
If you enjoyed reading this, check out these resources:
Get similar stories in your inbox weekly, for free
Share this story:
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …