Little Known Ways to Better Use Your Error Budgets
In this blog post, we’ll look at how error budgets can help cross-functional teams across the organization such as QA, legal, executives, and more. We’ll also look at ways engineers can use error budgets beyond development planning.
Originally published on Failure is Inevitable.
One of the most versatile and foundational SRE tools is the SLO, or service level objective. The SLO is a threshold set for key reliability metrics. When incidents push the metric over the threshold, a response launches to prevent further damage. Conversely, as long as you meet your SLO, you can continue to ship new code. The space you have before you breach this threshold is the error budget. When evaluating new developments, you can judge if the error budget can accommodate the potential risk of unreliability.
We generally think of the error budget as a tool for developers. It helps them understand tradeoffs between development velocity and reliability. But error budgets can be helpful to many roles throughout the organization. In this blog post, we’ll look at how error budgets can help cross-functional teams across the organization such as QA, legal, executives, and more. We’ll also look at ways engineers can use error budgets beyond development planning.
Legal teams can use error budgets as early warnings
An unintended consequence of too much unreliability is an SLA violation. SLAs, or service level agreements, are legal contracts between the organization and its clients. They guarantee certain standards for the users of the service. In the event of an SLA breach, the organization could be liable to pay fees, or the contract could be terminated early.
SLOs safeguard the SLAs by triggering in advance of an SLA violation. This gives a chance for engineers to respond. Ideally, these responses will suffice to prevent an SLA violation. Failure is inevitable, however, and legal teams have to prepare for such possibilities.
The error budget provides an early warning that there may be a risk of an SLA violation. Legal teams can look at the rate of error budget depletion to see when a violation may be imminent. This gives them a timeline to work with, so they can also prepare proactive responses and measures accordingly. Legal responses to SLA violations may also include hiring consultants or other expenses. Knowing when such investments are unnecessary ensures they avoid overspending. The error budget keeps legal teams confident in the organization’s level of risk exposure, as well as in their own ability to be prepared.
Executives can use error budgets to take the pulse of development
Understanding the entirety of your organization’s development landscape can be a daunting task. Often many projects and sprints are happening simultaneously. Regardless, executives need to make decisions based on the overall trajectory of development. How can you consolidate many different statuses into something actionable? Error budgets provide a way.
SLOs and error budgets are built around SLIs. SLIs indicate the most impactful service areas to customer satisfaction. This makes them relevant to the executive team’s strategic decisions.
Executive teams can parse the error budget on an organization-wide, strategic level. They can easily take the pulse of the most impactful service areas without needing to understand the details behind them. When they make decisions based on these metrics, however, it isn’t disconnected from the practical work that needs to be done. That’s because SLIs are made of a collection of low-level, monitorable metrics. This provides a common language for executives and engineers. Strategic decisions can be translated into coding changes through the error budget.
Error budgets and SLOs elevate the role of QA
When testing new code before deployment, it can be tempting to hope QA teams will be limitlessly thorough. However, there will always be bugs that slip through. QA teams want their tests to focus on the possibilities that will be the most impactful. Error budgets can guide QA teams to find where these areas are.
Error budgets tell the entire story of the impact of a bug. You can see how many users were impacted and how much error budget was burned. You can also see how long it takes for the incident to be resolved and the burn to stop. By looking at patterns, you can see the average impact different types of bugs cause. QA teams can use this data to develop tests that focus on the bugs that burn the most error budget (meaning they have the largest customer impact).
As QA tests become more aligned with the error budget, QA expertise becomes less siloed. Development can integrate QA’s testing into their own production standards. Testing is thus refactored into the development process. This reduces the toil of manual testing and allows QA to take a more strategic role, designing tests to best meet these needs.
Error budgets provide objectivity for experimentation
An important component of reliability is experimentation. By understanding the limits of your system, you know better where to improve and how to respond when things go wrong. It can be difficult to understand the relative impact of different experiments. For example, consider these types of tests:
- A chaos engineering experiment where a major service outage is simulated
- A load test where multiple simulated requests are sent simultaneously
- An experiment where requests are routed to servers based on different rules
For each test, many variables can be changed, creating diverse results. Pinning down the importance of each change can be difficult. Error budgets provide a singular and objective variable that can be compared across different tests and experimentation scenarios.
Moreover, error budgets can be used to compare completely different types of scenarios. For example, this could allow you to understand the severity of a single server failing compared to an incident causing all pages to load 10% slower. This knowledge can help you better prioritize developing for reliability. Error budgets reflect the areas of highest customer impact, so you’ll know your tests are focused on the right areas.
Blameless can help you get the most out of your SLOs and error budgets. Our out-of-the-box dashboards and checklists keep all stakeholders on the same page, guiding you through the steps of not just setting up, but truly operationalizing SLIs and SLOs. To see how, check out our webinar on SLOs.
If you enjoyed this blog post, check out these resources:
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.