Improving Postmortem Practices with Veteran Google SRE, Steve McGhee
The rope out of pager hell is weaved with a thorough and rigorous postmortem process.
For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family on the moon.
How can teams climb out of it?
How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control? The rope out of pager hell is weaved with a thorough and rigorous postmortem process.
Steve McGhee is an expert in postmortems and SRE. From a decade of leading advanced SRE practices at Google to introducing SRE practices and culture to MindBody, Steve has a unique perspective and clarity on what defines realistic and mature postmortem practices. In our interview with Steve, he shares nuanced insights on how you can take your company’s postmortem practices to the next level. (Disclaimer: Please note that Steve gave this interview prior to re-joining Google, so this interview is not a statement from Google and nor does it represent Google’s view.)
Often-Overlooked Components of High Quality Postmortems
At Google, postmortems always have a consistent structure — problem, trigger, root cause, correlating problems, action items — and ideally span 2 to 3 pages in length. This structure is only a starting point. All of the more advanced five practices below serve one common purpose: to maximize learning and continuous improvement.
1. Choose the Right Author
Postmortems are often written by someone who didn’t go through the incident themselves: a team lead or a junior team member. The intent behind putting a senior engineer on the postmortem is well-intentioned, as they’re more likely aware of broader issues or deeper technical history, or they simply have more practice in writing postmortems, but this is ultimately self-defeating. Removing the original participant of the incident can dilute the quality of information and leave out important details regarding the situational context. Collecting details may become a time-consuming game of telephone.
The author of the postmortem document should be the individual(s) who went through the incident and contributed to its resolution, because only they can properly explain the what/when/why/hows of their actions. If an engineer is trusted with the pager, then they can be trusted to own the postmortem. The right postmortem author enables the highest quality of learning for future readers.
It’s beneficial to set expectations for all on-call engineers to track in real time what they are doing, when, and why. Ideally during incident resolution, they have two windows on their desktop: one window is the in-progress postmortem draft and the other is the browser or command line for incident resolution. Of course, outage mitigation and restoration of service is primary, the documentation is secondary. Having both windows open simply makes this easier and more likely to result in a useful postmortem document.
If an engineer is trusted with the pager, then they can be trusted to own the postmortem.
2. Turn the Postmortem into a Textbook-let
You can show the progression of understanding an issue through a series of graphs directly in the postmortem. Suppose a single database is returning data too slowly. Start the postmortem with a graph of the user-facing errors. Then drill down and import over graphs showing the affected traffic flows and backends. Finally, show the discovered source of errors.
When you include graphs as a storytelling aid, make sure that you include how the graph was derived (e.g. which variables are shown, filtered, aggregated, etc.) and possibly even a live link to the same viewport for others to follow.
A “what happened” narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.
3. Focus the Timeline
There is little utility to including the entire chat log of an incident. Instead, consider illustrating a timeline of the important inflection points (e.g. actions that turned the situation around). This may prove to be very helpful for troubleshooting future incidents.
Beware of the urge to redact names. Redacting names often leads to the loss of valuable conversational context. It may signal an anti-pattern to a blameful culture. You can preserve objectivity and context by replacing names with pseudonyms or team members’ positions.
4. Create Action Items with Structured Discipline
To improve product reliability in a more systematic way, structure your action items. At Google, all postmortems have at least one action item in each of the 3 categories below:
- Detection - How can you detect it with better fidelity/precision next time?
- Mitigation - How can you get out of an incident like this faster next time?
- Prevention - How can you prevent this class of incidents for the future?
Each action item will contain:
- A name / link (to Jira, etc.)
- A single owner / username
- Type (prevent, detect, mitigate)
Steve shared a postmortem template in his blog post on rebuilding SRE from memory.
5. Seek Feedback on Postmortems
How do you know if you’re writing better postmortems?
The only way for SREs to get better at writing postmortems is through postmortem reviews. These are weekly meetings where a board of experienced and cross-functional SREs come together for an hour to offer feedback on three postmortems submitted for review, spending 20 minutes on each postmortem. Any engineer is welcome to join the postmortem review. During the meeting, the board of SREs offers guidance and stimulates discussion. They may ask:
- “Do you think you understand all the factors that contributed to this outage?”
- “Did you consider X, Y, or Z alternatives in your resolution process?”
- “How are your action items prioritized? Have you considered alternative prioritization for reasons A, B, or C?”
- “Did you look at postmortem ##-####? They encountered a similar issue that you might find helpful.”
This board acts as a mentor by encouraging incremental improvement rather than as a manager demanding perfection. They model a constructive, in-depth, and blameless postmortem process for junior engineers. It’s important to note that postmortem reviews are not scheduled in response to an outage, but proceed on a clear cadence (e.g. weekly) with an open invite. Anyone can add a postmortem to be reviewed to the agenda. Anyone can lurk in the meeting :)
Through postmortem reviews, junior engineers can grow faster and be more helpful to their organization than if they just went through one postmortem after another without reflection or feedback. It takes one organizer and a handful of senior SREs to start and maintain these reviews. Companies with smaller SRE teams can seek out external SRE advisors for feedback and learning.
Postmortems: A Great Starting Point for SRE Adoption
For companies looking to adopt SRE but don’t know where to start, initiating a postmortem practice can catalyze a virtuous cycle of reliability improvement.
Conducting blameless postmortems will enable you to see gaps in your current monitoring as well as operational processes. Armed with better monitoring, you will find it easier and faster to detect, triage, and resolve incidents. More effective incident resolution will then free up time and mental bandwidth for more in-depth learning during postmortems, leading to even better monitoring.
In other words, building a postmortem practice will eventually enable you to identify and tackle classes of issues, including fixing deeply rooted technical debt. With time, you'll be able to practice SRE, directly improving the systems continuously.
Among the seemingly disconnected incidents, there are actually common shapes or underlying patterns.
Unfortunately, many engineers never go back and look at the postmortems they’ve written. They are missing out on the clues along the detective’s investigation to solving the mystery. Because, among the seemingly disconnected incidents, there are actually common shapes or underlying patterns. After all, incidents all come from the same system. If you uncover and resolve these underlying issues, you could prevent entire classes of incidents. Then, you might just see the light out of the not-so-inevitable pager hell.
This is the first article of a two-part series. Clickherefor part 2 of the interview with Steve McGhee.
Written by Charlie Taylor
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.