How to Scale End-to-End Observability in AWS Environments

The Ultimate, Free Incident Retrospective Template

The Ultimate, Free Incident Retrospective Template.png

Below is an example of what a comprehensive, narrative incident retrospective could look like.


    Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.

    Yet, many teams find themselves unable to complete incident retrospectives on a regular basis. One common reason for this is that day-to-day tasks such as fixing bugs, managing fire drills, and deploying new features take precedence, making it hard to invest in a process to streamline post-incident report completion. To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. Below is an example of what a comprehensive, narrative incident retrospective could look like.

    Summary

    This should contain 2-3 sentences that gives a reader an overview of the incident’s contributing factors, resolution, classification, and customer impact level. The briefer, the better as this is what engineers will look at first when trying to solve for a similar incident.

    Example: Google Compute Engine Incident #17007

    This summary states “On Wednesday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring.”

    People involved and roles

    This section should list the participants in the incident as well as what roles they played. Common roles include:

    • Incident commander: Runs the incident. Their ultimate goal is to bring the incident to completion as fast as possible.
    • Communications lead: Is in charge of communications leadership, though for smaller incidents, this role is typically subsumed by the Incident Commander.
    • Technical lead: An individual who is knowledgeable in the technical domain in question, and helps to drive the technical resolution by liaising with Subject Matter Experts.
    • Scribe: a person that's maybe not completely active in the incident, but who is transcribing key information during the incident.

    You may have one or none of these depending on how you structure incident response.

    Customer impact

    This section describes the level of customer impact. How many customers did the incident affect? Did customers lose partial or total functionality? Adding tags can be helpful here as well to help with future reporting, filtering and search.

    Example: Google Cloud Networking Incident #19009

    In the section titled, “DETAILED DESCRIPTION OF IMPACT,” authors thoroughly breakdown which users and capabilities were affected.

    Follow-up actions

    This section is incredibly important to ensure that accountability around addressing incident contributing factors looks forward. Follow-up actions can include upgrading your monitoring and observability, bug fixes, or even larger initiatives like refactoring part of the code base. The best follow-up actions also detail who is responsible for items and when the rest of the team should expect an update by.

    Example: Sentry’s Security Incident (June 12 2016)

    While detailed action items are rarely visible to the public, Sentry did publish a list of improvements the team planned to make after this outage covering both fixes and process changes.

    Contributing factors

    With the increase in system complexity, it’s harder than ever to pinpoint a root cause for an incident. Each incident might have multiple dependencies that impact the service. Each dependency might result in action items. So there is no single root cause. To determine a contributing factor, consider using “because, why” statements.

    Example: Travis CI’s Container-based Linux Precise infrastructure emergency maintenance

    In this retrospective, authors cover contributing factors such as a change in docker backend executes build scripts, missing coverage in terms of alerting for the errors, and more.

    Narrative

    This section is one of the most important, yet one of the most rarely filled out. The narrative section is where you write out an incident like you’re telling a story. 

    Who are the characters and how did they feel and react during the incident? What were the plot points? How did the story end? This will be incomplete without everyone’s perspective. 

    Make sure the entire team involved in the incident gets a chance to write their own part of this narrative, whether through async document collaboration, templated questions, or other means.

    Timeline

    The timeline is a crucial snapshot of the incident. It details the most important moments. It can contain key communications, screen shots, and logs. This can often be one of the most time-consuming parts of a post-incident report, which is why we recommend a tool for automation. The timeline can be aggregated automatically via tooling

    Technical analysis

    Technical analyses are key to any successful retrospective. Afterall, this serves as a record and a possible resolution for future incidents. Any information relevant to the incident, from architecture graphs, to related incidents, to recurring bugs should be detailed here.

    Here are some questions to answer with your team:

    • Have you seen an incident like this before?
    • Has this bug occurred previously, and if so, how often?
    • What dependencies came into play here?

    Incident management process analysis

    At the heart of every incident is a team trying to right the ship. But how does that process go? Is your team panicked, hanging by a thread and relying on heroics? Or, does your team have a codified process that keeps everyone cool? This is the time to reflect on how the team worked together. 

    Here are some questions to answer your team:

    • What went well? 
    • What went poorly?
    • Where did you get lucky and how can you improve moving forward?
    • Did your monitoring and alerting capture this issue?

    Messaging

    Communication during an incident is a necessity. Stakeholders such as managers, the line of business (i.e. sales, support, PR, etc.) C-levels, as well as customers will want updates. But communication internally and externally might look very different. Even communication internally might differ between what you would send a VPE, vs. your sales team. 

    Here, document the messaging that was disseminated to different categories of stakeholders. This way, you can build templates for the future to continue streamlining communication.

    Example: Google Compute Engine Incident #15056

    In this incident, Google ensures that all major updates are regularly communicated. The team also lets users know when they can next expect to be updated. “We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.”

    Other Best Practices to Keep in Mind

    • Do the report within 48 hours
    • Ensure reports are housed such that they can be dynamically surfaced during incidents
    • Add graphics and charts to help readers visualize the incident
    • Be blameless. Remember that everyone is doing their best and failure is an opportunity to learn

    Parting Thoughts

    Failure is the most powerful learning tool, and deserves time and attention. Each retrospective you complete pushes you closer to optimal reliability. While they do take time and effort, the result is an artifact that is useful long after the incident is resolved. 

    By using this template, your team is on the way to taking full advantage of every incident.

    If you enjoyed this blog post, check out these resources:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …