Improving Postmortems from Chores to Masterclass with Paul Osman
Originally published on Failure is Inevitable.
Paul Osman: I lead the SRE team at Under Armour. Who here knows about Under Armour as a tech company? Does anybody think about Under Armour as a tech company? Under Armour makes athletic attire, shirts and shoes. We are also the company that owns MapMyFitness, MyFitnessPal and Endomondo, which are all fitness trackers that our customers use to either keep track of nutrition goals or fitness goals. That's Under Armour, and specifically that's actually my team. We work on the reliability of those consumer applications.
I'm going to be talking today about postmortems and the postmortem process. First, I'm going to zoom out a little bit and talk about incident analysis, which is actually a much bigger field. Incident analysis is the study of how to actually analyze what happened during events that we can learn from. I'm going to try to give some lessons that we've learned doing incident analysis.
Let's talk about postmortems. The term comes from medicine and the actual definition is an examination of a dead body to determine the cause of death. That was the first thing that came up when I Googled it. That's awful. I'm really glad that's not what I do. I have a huge amount of respect for doctors. I don't have the stomach for it, and I'm really glad that nobody tends to die when our systems go down. I'm sure there are maybe some people in this room for whom that's true, but that's certainly not true for us.
The reason I'm bringing this up is, I hate the term. I hate that we've all decided that postmortem is the term for what we do. I don't think it is, but I'm probably not going to change the vernacular of our industry, so we'll go with it.
Postmortems have to do with incidents, accidents that happen in production systems usually in this context. John Allspaw, who's the former CTO of Etsy, has described incidents this way: “Incidents are unplanned investments.” There's two things that I really like about that. One is the emphasis on the fact that incidents are unplanned. They're surprises. The other thing is that they're investments. Incidents take time. They add stress to people's lives, not just customers, but internally to engineers, people who are working customer support cues, and other stakeholders.
These are all people that we're paying. If we're making that investment, then what we talk about when we discuss postmortems and the postmortem process is opportunities to try to get some return from those investments. I've always thought of incident analysis or postmortems as opportunities for trying to recoup some return on that investment of that incident. What are those returns going to look like?
Action items are one way that we can recoup some return on those investments. Increasingly, I'm becoming convinced, however, that the real return is actually in learning. If we can figure out ways to improve how learning happens internally after an incident, then we improve lives for our customers and for our engineers, and we generally get better as an organization.
In practice, this is roughly what that process might look like: you construct a timeline of events that took place during an incident, you get a big group of people together—it can vary in size from the people who are directly involved in an incident to being an open invitation to everyone in the company—and you discuss that timeline. You go over the causal chain of events and then you talk about what went well. What are the things that we should preserve? You talk about what went wrong. And, depending on the group, this can get to be a very interesting discussion. Out of that, there will be a bunch of action items and that's the real value that you get.
This is definitely how I used to think about incidents. This was my primary mental model and it served me for a while. It definitely resonated with teams I was on and there is some use to it. But what are the problems here? I think there are a few.
One of the things that I found is that I've been a part of many postmortem review meetings where attendance was really poor. That's awful. This is an opportunity to learn from things that happened and to get better, so it shouldn't be like that. That seems like we might be doing something wrong. I've also heard people discuss these as theater, and I've certainly described them as theater in the past, meaning there's a trap you can fall into where you go through this review process, you generate a bunch of action items at the end and you're done, and then those action items sit in a backlog forever. That can really detract from morale and contributes to the first problem which is people not being excited to go to these things.
The other thing is that the timeline can be just pre-agreed upon. In this model, you've already assembled the timeline. You've come up with what happened by the time you get a group of people together to discuss it. That can be limiting.
One of the problems that I started to think about is that big meetings aren't the best place to talk about that timeline. If you are getting a big group of people together and you're asking them about this incredibly stressful, sometimes traumatic event, then having a big group of people isn't going to encourage people to get back into that mindset where they were responding to an incident. Maybe it was 2:00 in the morning, maybe they were under a lot of stress. I don't know about you, but a big group is not the place where I necessarily want to go through that again.
Try this instead. This is something that we've started experimenting with, which is having one-on-one interviews as part of your review process. Actually, Amy Tobey has a great talk called One-on-One SRE. Amy talks about her experiences conducting these interviews as part of the postmortem review process. We've started doing this internally at Under Armour and it has dramatically improved the experience.
One of the things that I found in conducting these interviews is that you can do things that you can't do in big meetings, which is like establish rapport, relate experiences, get people talking about what mindframe they were in, and what context they had during the response. Funnily enough, I've had people give me feedback that when we go through the process like this where someone interviews them and then we get together as a big group and we read back what we've been told, people felt more heard than if they were actually just given the floor during a postmortem review process. That was a really interesting takeaway for me.
But more than anything else, some of the gems that have come out of these interviews have been amazing. People talking about the ergonomics of tools that they're using, talking about some dashboard where they know is not right. There are all sorts of things that can come up in that one-on-one context that might not necessarily get surfaced otherwise. This emphasizes something I wanted to bring up. I've experienced this a lot and I've heard this repeated a lot: blamelessness is not the same as psychological safety.
Psychological safety is super important, and I think it's a great thing, but it's not blamelessness. What I mean by that is you can have a group of people who are super psychologically safe or comfortable with being vulnerable in front of each other, who are really close and work as a team effectively, and they will still not come up with this info. If you're doing things in a certain way, they will still be susceptible to falling into traps.
Blamelessness is not being safe to say, "I fucked up," because that's still centering you as the person who made a bad decision, which is actually inherently blameful even though it's an example of psychological safety. This tends to tease out some of these subtle nuances in the way that we use some of the vernacular.
Another problem that I've certainly experienced is that action items never get done. You have these postmortem review processes and these action items just sit at the bottom, which really makes me question if they're the key value that comes out of these things. If action items aren't getting done, are they that important or was the process geared to generate action items and that's why you did them?
Another thing that I've noticed here is that action items get done and you didn't know about it. You had the meeting, you went through what went well, what didn't go well, and you generated a list of action items. It turns out there's a whole bunch of stuff that engineers actually already did. Well, how could they know? You hadn't had the meeting yet, you hadn't come up with a list of action items, but it's because they had the context right after the incident.
If you have that engineer who's like, "You know what? I've been meaning to fix that thing. I knew it was wrong. We just had an incident; I'm going to schedule like five hours and just fix this thing,” that's not going to come up in one of these meetings necessarily, but you still want to capture that. This can be more effectively done through one-on-one interviews.
Something I've tried is shifting this focus towards stories, not action items. Action items are valuable, but stories I think are where the real value from incidents can be captured. I say stories because it enforces two things in my mind. Humans learn through storytelling. We relate experiences through storytelling. I think that just seems natural, and it really makes me think as somebody who does incident analysis, when I'm writing up a document or an artifact about an incident, I'm going to write it to be read because this is a story I'm telling. It's a narrative I'm forming. And if I do my job well, then this is going to be something that people refer to, that people enjoy reading, that they tell other engineers, "Oh, you're new on the team, you should read about this thing that happened." It just becomes part of the lore of a team, it becomes part of your organization.
These are the things that I think can really impact teams and organizations.They're gems that we can uncover. What this really stresses for me is this shift in thinking. Engineers love technical systems because technical systems can be reasoned about. We can think about them in certain ways, but the sources of resilience in our systems are not actually the technical systems but the humans who operate them. If we focus our processes on giving value to those humans, then we can really tap into the sources of resilience. Your systems are never going to be completely functioning, they're never going to be completely up to date, but what you can do is you can empower the people who have adaptive capacity to actually respond to those systems.
There are incidents I've been a part of and done the analysis of where engineers will say things like, "Oh, we always know that that dashboard isn't right." Well, then why don't you fix it? Well, maybe there's a bunch of reasons. Maybe they don't have time, maybe they don't have enough people on their team, maybe they don't have anybody who knows about that system right now. But if they know that, if they've internalized that, then they have the ability to say, "Double check those metrics. Don't rely on those.”
These are the types of things that can give you surprising adaptive capacity. Even in that example, it could be something that's more nefarious, more subtle and nuanced. It's not as easy to fix, but if people are learning and people are repeating stories about incidents, then they're going to have the ability to respond to incidents much more effectively.
This is also instructed on another shift of thinking which is uncomfortable to me as an engineer, or at least it was. It's coming to the conclusion that our goal in doing these postmortems is not actually to understand what happened. It's not to understand a clear causal chain of events that led to an incident. It's actually to understand the context that people were operating within when responding to an incident that either helped or hindered their ability to make decisions.That's something that you can only get by conducting one-on-one interviews, by focusing on storytelling, not action items. What was going through somebody's head? What kind of circumstances were they dealing with at the time?The things that helped you are your sources of resilience. The things that hindered you are the things that you can attack as an organization. Try to figure out what you can do to limit those things that hindered people's ability to make decisions during an incident.
This has been the evolution of how we've thought about incident analysis. It's gone from this very linear way of looking at things. When I first started, we were using a system called the 5 Whys. It's not without its uses. Something I always repeat when I'm talking about this stuff is all mental models are wrong, some are useful. This is wrong, but it is useful. I'm not going to completely trash it. What I am going to say is that the 5 Whys can really limit your analysis.
When we started, for anybody who's unfamiliar with the 5 Whys, it's the basic idea that you start with an incident and you work your way backwards by asking why did X happen? Well, because of X or Y. Why did Y happen? Because of Z. After five whys, you'll arrive at this root cause. What's interesting about that is it trains you to think about causal chains of events.
There's been a lot of work done in this area. There's a woman named Nancy Leveson who's a giant in the area of accident analysis. She's written tons about this and practiced accident analysis at scales that I can't even fathom, and she observed that different groups using the 5 Whys will arrive at different root causes. That immediately makes you suspicious of the method. What it's doing is, at every stage, you're actually limiting discussion to one causal chain of events and eliminating a whole bunch of other possibilities that are actually rich sources of information from your incident.
If you get different groups of people looking at the same accident or incident using a technique like this and they're arriving at different conclusions, that also leads to another thing which is this idea of a singular root cause. If you can get five groups of people applying the 5 Whys to the same event and they're coming up with five different root causes, is it possible that there isn't such a thing as a root cause? That there are multiple contributing factors to any particular incident? If we focus on one root cause, we're making arbitrary decisions about where we stop our analysis, which means that we are also limiting what we can learn from this incident. Instead of root cause analysis, we tend to think about contributing factors.
When I first joined Under Armour, we were doing root cause analyses, we were doing 5 Whys, and we actually did weekly root cause analysis meetings. The idea was, if we could find all the broken things and fix all the broken things, then we'd be in a better world. I think that went on for about a year and we saw zero improvement. We saw dwindling interest in these retrospectives, and we started asking, "How's this working for us?" The answer was, it wasn't.
This is what our adjusted process looks like now having gone through some of these shifts and incorporating these practices that try to bring out some pretty nuanced concepts. We now analyze data and what that means is an incident will happen and somebody is given the responsibility of shepherding the incident analysis process. They're going to go through what data we captured during that incident. It could be chat transcripts, could be video conferences, could be some recorded bridge or something like that. They're going to identify people who were playing a key role in incidents and select those people as interesting people to interview and then schedule times to interview those participants and get their perspectives and collect information.
We actually ask people if they mind if we record those for our own purposes as the incident analyst. I'm surprised at how many people are like, "Yes, please. I have no problem whatsoever." It's really useful for me when I'm going through that information afterwards and collecting notes to have a recording of that conversation. It also allows me during the actual one-on-one to focus myself completely on asking questions of that person and not sitting there with a laptop having to take notes, which can be alienating.
Out of that interview process, we go back and analyze. Sometimes we pick new people to interview. This can go back and forth a few times and eventually we write a draft analysis. This is structured to be read, not something that we want customers to get. It's something that we want people to be excited to read. We try to approach it as though we're writing a narrative. When we meet with a group, we make the invites open at Under Armour. That's really useful because it can get more people excited about learning from these things. We've actually found a surprising number of people wanting to show up to certain postmortem reviews.
During those reviews, the person who's tasked with doing this analysis, actually reads back the information. Like I said earlier, I've had people say that they've really felt like their point of view was represented even more so than if we just gave them a floor. This is an opportunity to say, "Hey, did I miss anything? Is there anything misrepresented here? Is this wrong?" Some really interesting discussions can come out of this because for the first time, you're taking all these one-on-one narratives and you're combining them and you're seeing what the group thinks is important or thinks is not important.
That can go back and forth too. You can have a meeting and go back to revise your draft. You can incorporate feedback from the group of people who were involved in the incident and then produce something new and then meet with people again and say, "Hey, how does this look? Am I getting this right? Does this accurately represent how it felt to be part of this incident and does this capture a lot of the learnings that we had from this incident?" Eventually you’ll publish with revisions.
One of the things that I try to be careful to do is document the action items as things that happened along the way because, like I said, the action items don't necessarily have to fall out at the end of the incident. It's not like a function that you input an incident, you get action items out. Action items can be stuff that engineers did in the moments after an incident. They can be stuff that happened in the next sprint after an incident if that amount of time has passed. It's good to document those things. I like to keep track of those things because it gives us a certain amount of confidence. We can look back and say, "Look at all the things that people have self-organized to do in response to these incidents."
As far as publishing, this is something that we're still tackling. We don't have a perfect solution for this, but make these things accessible. I have a future vision in my head where we have some internal tool that makes these things searchable by tags where a new engineer can just come on board and just say, "Show me everything that's ever happened to this system or everything that's involved this service that I work on.”
Some concrete takeaways that I would encourage is, if any of this resonated with you, practice interviewing responders as part of your postmortem process. It's been a really interesting experience for me. I would definitely encourage it. It also helps you connect with a lot of people on different teams in ways that you weren't maybe previously able to do. Focus on stories instead of action items. Think about incidents as opportunities for storytelling and for things that you can learn and internalize in your organization.
Understand that I'm not going to try to convince you all to change your process completely, but at least understand that 5 Whys and root cause analysis can limit our investigations. They can focus on one thing instead of a plethora of opportunities. Write incident reports to be read. Practice writing as a skill. If you're involved in the SRE world, then you're communicating and it's a really important skill I think to try to improve.
Focus on humans more, software less. Humans are a huge part of your system. They are there to protect when things go bad and they're there to make sure that the systems are always improving. Give them knowledge, give them opportunities to tell their stories and to surface what they go through operating the systems that we build. This is one of the areas where I think as software people, we have a lot to learn from people in other industries about how they do this stuff.
If you enjoyed this, check out these resources:
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.