[ebook] Mastering Kubernetes Autoscaling

5 Best Practices on Nailing Incident Retrospectives

Reading about postmortem best practices can sometimes be quite different from seeing them in action. Postmortems are like snowflakes; no two will ever look the same. There isn’t a definitive template for success that will work in every situation, but there are some practices and procedures when writing postmortems that can help. Here are five practices that can boost the effectiveness of your postmortems, with examples of postmortems or procedures that demonstrate these methods.

Use visuals in your postmortems

As Steve McGhee says, “A ‘what happened’ narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.” Graphs provide an engineer with a quick and in-depth explanation for what was happening during the incident days, weeks, or even years later.

In Cloudflare’s postmortem of an incident occurring July 2, 2019, authors use visuals to help readers understand both the background of the incident as well as what happened when a bad update caused a DNS outage. The postmortem reads, “Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving. This brought down Cloudflare’s core proxying, CDN and WAF functionality. The following graph shows CPUs dedicated to serving HTTP/HTTPS traffic spiking to nearly 100% usage across the servers in our network.” A graph showing the CPU usage during the incident follows:

Cloudflare CPU usage graph during an incident

Visuals embedded within in the postmortem benefit readers in two major ways. First, this allows new hires to visualize problem and feel like they’re working through the incident with the engineers who mitigated it. Second, it allows engineers who may deal with a similar issue to quickly find the information they’re looking for and be able to disseminate it to other team members easily.

Be a historian

Using timelines when writing postmortems is very valuable. However, there’s an art to crafting them. As Steve McGhee says, “There is little utility to including the entire chat log of an incident. Instead, consider illustrating a timeline of the important inflection points (e.g. actions that turned the situation around). This may prove to be very helpful for troubleshooting future incidents.” Postmortem timelines require the perfect balance of information. Too much to sift through, and the postmortem will become cluttered. Too little and it’s vague.

In Twilio’s “Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause,” this balance is exceptional. What Twilio does well in this postmortem is clarity. For example, in this particular incident, the authors separated the root cause and timeline. In the entry for 1:35 AM July 18, the timeline note simply reads, “We experienced a loss of network connectivity between all of our billing redis-slaves and our redis-master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time.” However, in the root cause analysis, the postmortem authors further expound on this time stamp with pertinent background information by explaining that the loss of network connectivity “caused all redis-slaves to reconnect and request full synchronization with the master at the same time,” and how this affected the redis-master.

Though the timeline entry is half the word count of the explanation in the analysis, it still relays the most crucial information. The benefit of this is speed. If the billing redis-slaves simultaneously disconnect again, an engineer might want to look back on this postmortem as a clue. When postmortem timelines are streamlined to include only the most important moments, while all background information is included in the root cause analysis, an engineer can clearly see what actions they should consider taking next without having to use precious time sifting through clutter.

Publish promptly

As the Google SRE book says, “A prompt postmortem tends to be more accurate because information is fresh in the contributors’ minds. The people affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!” Promptness has two main benefits: first, it allows the authors of the postmortem to report on the incident with a clear mind, and second, it soothes affected customers with less opportunity for churn.

Google certainly practices what it preaches, as do many best-in-class companies like Uber and others. These companies often publish postmortems within 48 hours. This discipline leads to postmortems that are accurate. After two months, will your team remember exactly what happened during an incident, even after looking at the logs? It’s not likely. Publishing postmortems within two days of mitigation ensures the information is fresher and more useful for teaching/onboarding and for reference in the case of similar incidents.

Furthermore, prompt postmortems are crucial to foster a culture of transparency that maintains customer trust. Customers feel upset if an incident affects them. In the case of an incident involving critical features, billing, or data breaches, customers will often be on edge waiting for an explanation. Some of your customers may even have SLAs set for the promptness of a postmortem detailing the incident. Waiting to publish only increases customer dissatisfaction. However, if the incident is promptly explained via a detailed and accurate postmortem, customers don’t have to linger in anxiety.

Be blameless

We commonly refer to blameless postmortems when talking about best practices. However, what does blameless culture actually look like? When writing postmortems, there are 3 important things to keep in mind to promote blamelessness.

People are not points of failure. Pinning an incident on one person, or a group of people is counterproductive. It creates an environment where people are afraid to take risks, innovate, and problem solve. This leads to stagnancy and avoidance.
Everyone on the team is working with good intentions. People make mistakes. It’s extremely rare for a team member to cause problems maliciously. Everyone is simply doing what makes the most sense to them at the time in order to be helpful.
Failure will happen. There’s no way around it. However, by having a good incident resolution and postmortem practice in place, failure can actually be a benefit to your team, as it uncovers areas to focus on to improve resiliency. As long as you learn from an incident, you’ve made progress.

Many teams choose to have a meeting after an incident to talk through what happened. Etsy created an introduction to this meeting that voices the 3 above points for all attendants. In Etsy’s Debriefing Facilitation Guide it states, “The goal for our time together today is to recreate the event, talking through what happened for each person at each stage in order to create as robust a portrait as possible of what happened, and what the circumstances in play were at each juncture (when decisions were made, and actions were taken) that made it make sense for people to do what they did in the moment. If one of you gains an insight into the complexity of another person’s role, this was an hour well spent.”

‍Sentry’s postmortem from a security incident occurring July 12, 2016 demonstrates this well. Firstly, the postmortem uses the collective “we” pronoun to eliminate naming people as problems. Additionally, it states “It’s been a valuable experience for our product team, albeit one we wish we could have avoided.” The point here is that this was a learning experience. Failure happened and will happen again. Sure, incidents are painful, but they’re one of the best ways to learn and become better.

Tell a story

An incident is a story. To tell a story well, many components must work together.

Without sufficient background knowledge, this story loses depth and context.
Without a plan to rectify outstanding action items, the story loses a resolution.
Without a timeline dictating what happened during an incident, the story loses its plot.

Make sure that your postmortems have all the necessary parts to create a compelling and helpful narrative.

In Travis CI’s postmortem on high queue times on OSX builds, our author begins by giving an overview of the incident itself. Next, we have background that explains its relevance to the incident by stating, “Understanding this separation of the creation/build run and the cleanup parts of the life-cycle becomes important in understanding what contributed to this incident.”

After the background, we get into the incident itself. The author walks us step by step through what happened during the timeline, using timestamps to show us the duration. After sharing how the incident was mitigated, the authors list three main objectives to strengthen infrastructure they plan to work on. The story closes with an excellent, blameless summary, which includes, “We always use problems like these as an opportunity for us to improve, and this will be no exception.”

By learning from example and taking the best parts of what others do and applying it to your organizational context, you and your team can write better postmortems for each incident. Postmortems shouldn’t be done simply as a checkbox item, but rather that as a way to catalyze introspection and action to prevent further incidents. Again, there’s no one size fits all, but your team can apply any one (or all) of the above practices starting today.

If you want more reading, check out:

This example postmortem from Google
Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

Written by: Hannah Culver

Get similar stories in your inbox weekly, for free

Share this story:

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

Published by

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.