Engineers, Stop Hoarding your Metrics

Like The Hobbit’s dragon Smaug laying on his pile of gold, never spending and only hoarding, many of us often stockpile pretty, feel-good, but useless metrics that never make a difference. In fact, they could actually be clouding your ability to get the context and clarity you need from your metrics. In this blog post, we'll help you kick your fetish and move away from Smaug-ing up all your metrics.

Originally published on Failure is Inevitable.

Metrics are the golden ticket to knowing what’s going on with your system… or so everyone thinks. But there can be too much of a good thing. Are your metrics really doing you any favors? Are they letting you see into what your customers truly want from you? If not, you might have a problem. You might be fetishizing your metrics. The good news is you’re definitely not alone.

Dragon defenders of metrics

One of the symptoms of dragon psychology is defensiveness of your hoard. Smaug, who never spends or uses any of his treasure, sits defending it against thieves. For those of us who write and operate software systems, this fault is echoed when someone challenges the pages and pages of metrics and dashboards that we tend to stockpile.

Have you been challenged to justify why you’re watching a metric? How did you respond?

The unfortunate truth is that observability of software systems is often done as an afterthought, and many operators inherit predetermined metrics that past operators deemed “useful.” Even for the enterprising Site Reliability Engineer, it can be difficult to determine what really moved the needle in terms of increasing business value. This could make leadership hesitant to change current metrics. Afterall, what dragon likes moving lairs?

Site Reliability Engineering practices have an answer to this issue. A service level indicator (SLI) synthesizes many metrics and generates a single value which is equated to customer happiness. SLIs can demonstrate user happiness by following critical user journeys and pinpointing areas where customers will expect the service to perform in a certain way.

When a user clicks on the blog page of a website, they expect many things. As Alex Hidalgo notes in his book, Implementing Service Level Objectives, here are some of the things you know you should care about:

Is the service up?
Is the service available?
Is the service responsive?
Are there enough good responses when compared to errors?
Are the responses in the correct data format?
Are the payloads of the responses the data actually being requested?”

You may be reporting on these already as individual metrics. If you’re monitoring these individually yet having to correlate them together to gain a better understanding of whether or not your system is performing, you’re adding cognitive overhead to your job. SLIs, by nature, synthesize all of these individual aspects of user expectations into a single binary–this request either met expectations, or it didn’t.

However, if these SLIs only exist in the minds of the engineers who create them, they’re as useless as a ruby necklace to a dragon. A dragon can look at it, but can’t wear it. It’s not functional. You’ll need to document your SLIs in a way that all stakeholders (i.e. product, operations, QA, etc.) can intuitively understand, often in a publicly viewable document. The goal of defining the SLI is to identify a clear way of explaining user expectations to everyone, which can then be translated into a meaningful expression as code to be measured for your system and your customers.

What do dragons need gold for, anyway?

When you have seemingly endless riches, what can you use them for? Hoarding them for a rainy day gives no pleasure to anyone. Even if you move away from metrics, you can still be a hoarder. Changing out metrics for SLIs doesn’t itself make a difference. Instead, you need to set a target or threshold for reliability that draws the line between a happy customer base and an unhappy one.

This is powerful because having a definition that all stakeholders agree upon gives teams the opportunity to break things, experiment, and have more room to play. Not every failed request or error is a customer-impacting incident. And customer-impacting incidents are not of the same severity or classification. Acknowledging this helps set boundaries to say that, even when your system experiences failures, you can still focus on new features (up to the user experience threshold that has been agreed upon).

That’s why service level objectives (SLOs) are so important. They’re a governance (i.e. decision-making) tool. With SLO-based alerting, you can set your alerting system to only page you for incidents that affect your customers in a meaningful way. If your customers are unlikely to even notice that the system is experiencing an issue, it’s probably not worth coming to the office on the weekend anyway.

To aid with governance, SLOs have an alternative formulation known as an error budget. Instead of thinking of an SLO as a percentage–such as 82% or 99.99%–you can recalculate it as an error budget, often reported as the amount of time the service can be unreliable in aggregate over a certain window.

For example: An SLO of 82% gives an error budget of 5.04 days (in a 28 day window). An SLO of 99.99% gives an error budget of 4.032 minutes (in a 28 day window).

During planning time, pages of contextless metrics don’t help you make quicker and better decisions about what deserves the team’s effort and why. Alternatively, knowing that your SLOs are regularly being violated gives you the impetus to focus on reliability. With SLOs and error budgets as a conversation-starter between stakeholders, your team can better quantify what resources you’re spending, and why they matter.

“Un-hoarding” your metrics

Long story short, don’t be a dragon. Hoarding a bunch of meaningless metrics doesn’t make you safer or your team more information-rich. Instead, you wind up having to context switch, and it’s difficult to explain to anyone why what you’re doing matters. Most importantly, while they probably make you and your leaders feel good, your customers won’t care.

SLIs help you gain insight into what the most important metrics are. SLOs let you know whether you’re measuring up to your customers’ expectations. And error budgets let you know how much unreliability you’re allowed before your customers begin to be frustrated. Keep it simple, clear, and Smaug-free.

If you enjoyed this blog post, check out these resources:

Get similar stories in your inbox weekly, for free

Share this story:

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.