How SLOs Help Your Team with Service Ownership
Originally published on Failure is Inevitable.
Service ownership is becoming a best practice for teams looking to innovate while maintaining the level of reliability that customers expect. Service ownership means seeing the service through its entire lifecycle. In short, it means you build it, you run it. You’ll be responsible for the service’s security, reliability, performance, and quality.
This doesn’t mean you won’t have help from SREs to optimize or automate toil. It does mean that, as a developer, you need to build with quality in mind as you’ll be on call if your code breaks.
Service ownership comes with many benefits to the team as well as the organization. Service ownership is crucial because, as Instacart’s VP of Infrastructure Dustin Pearce notes, “The decoupling of the people who are creating and the people who are owning creates problems, both in the software and in the process. The quality goes down.”
But it can be difficult to know where to begin, which is where service level objectives (SLOs) can make a big difference. SLOs can help teams by:
- Abstracting complexity and creating a shared language around service metrics (as in this case study from Twitter)
- Helping teams prioritize key user journeys
- Unifying incentives across stakeholders (product, engineering, ops)
- Providing data to justify where to focus engineering efforts
Using metrics to learn about service health
Teams who are new to service ownership may feel daunted, as it does come with significant responsibilities. On-call burden can become a nightmare if teams are constantly getting paged on low-signal alerts or false positives. As such, the first step is to identify what makes for a healthy system. This helps ensure that the team is oriented around the signals that matter. As engineers aren’t in direct contact with customers, it can be difficult to measure their experience. SLOs can provide crucial context for this.
As Dustin said, “Looking for those reliability or service level indicators, and defining some objectives for how you measure reliability or availability or correctness, has long been around. It's not like we're inventing signals here. The key is how do we pull those signals together in a cohesive view.”
An SLO, in other words, is a temperature check for your customers’ happiness, to help prevent the gnarly consequences of breaching your SLA. So, you should make sure that the metrics you’re measuring matter to your users.
Take the time to walk through your user journey and determine SLIs your customers care about. Then, optimize for those pain points. Checkout page loading too slowly? Service not available during peak hours of demand? Focus on those metrics for your SLOs, and set alerts to make sure that potential error budget violations bubble up early.
Unifying incentives and getting buy-in
Another benefit of service ownership is getting all stakeholders on the same page. To make sure this happens, everyone will need to be on board, from developers to C-levels. All parties will need to agree on the SLOs, escalation policies, and proposed customer impact, which again, is very challenging but will reap outsized benefits if done right. This agreement will become your error budget policy: a process that codifies what happens when SLOs are breached.
This can’t be a mandate from execs that engineering must obey. While successful implementation of SLOs requires cultural transformation and top-down support, the creation of SLOs should stem from grassroots efforts with leadership allowing engineers to drive adoption. In an article on Twitter’s reliability journey, SREs Brian, Zac, and JP point to leadership support as essential to making reliability a first-class priority, which helped catalyze initiatives and hiring to support reliability.
To get alignment, hold team meetings first to determine what matters most to the customer and set SLOs. Then, talk to management. Share how SLOs can help save time and avoid incidents. Next, take it to the executive level and explain the positive business impact SLOs can offer.
Once everyone is on the same page on goal-able metrics, teams will feel more confident about service ownership.
Balancing innovation and reliability
SLOs also help teams point to data to make difficult but essential tradeoffs between innovation and reliability. In today’s fast-paced SaaS world, that balance is crucial. As Dustin said, “You've got to move. You've got to innovate. You've got to capture marketing growth fast, because if you don't, somebody's going to blow by you. It's the way it is. The machinery and the opportunity to start a company and go far with three human beings and a laptop is absurd.”
If you fail to innovate quickly enough, your customer will move to shinier products. But if you run too fast and pile on the tech debt, you could face a code freeze, giving competitors the opportunity to surpass you. To date, making decisions on where to invest has been largely political, or simply guesswork. This balance becomes easier to strike using SLOs.
SLOs and their corresponding error budgets give teams guidelines for when to ship. If your error budget has lots of wiggle room at the end of the window, ship away! If you're breaching SLOs too often, it's time to slow down and shore up reliability. After all, the most important feature is knowing ‘it just works.’ This helps keep the product roadmap centered around the needs of the customer.
SLOs and service ownership unite
While service ownership can be a daunting task, SLOs are a good place to start.
Dustin had some inspiring words of advice to keep service ownership teams ahead of the game: “The idea that you are vigorously prioritizing and always challenging your assumptions and never sitting on your laurels is important. Most leadership teams all agree that is the kind of thing that connects us.”
With SLOs, you’ll be able to run fast without sacrificing quality, and ensure the metrics you monitor both reflect customer needs while improving signal:noise for your on-call team. All these things are hallmarks of exceptional service ownership.
If you liked this article, check these out:
Share this news with your followers
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.
1 month ago
3 days, 16 hours ago
1 day, 20 hours ago