Are you Great at Incident Response?
There are three components to being exceptional during an incident. These components are crucial whether you’re in the office or at home working from your couch.
With remote work and distributed teams as the norm, incident response is trickier. Years ago, everyone would gather in a war room and sort through the issue together, boots on the ground. Now, things have shifted. Teams need to adapt to resolve incidents, even if team members are a thousand miles away. But how can we make great incident response a reality?
There are three components to being exceptional during an incident. These components are crucial whether you’re in the office or at home working from your couch.
Ability to recognize how bad the situation is, and prioritize it
So, there’s a new incident. It’s only natural to be a little nervous, but keeping a level head during this tough time is key. To be exceptional at handling incidents, it’s important to know what you’re dealing with and react as fit.
Is your incident a Sev 0, or a Sev 3? Acknowledging the difference can change the entire tone of the incident. Do you need to call the entire team on the weekend, or can one on-call member handle it until Monday morning? If too many people spend weekends responding to incidents, they'll burn out. This is especially true if a large percentage of the incidents could have waited until the work week.
According to Amy Tobey, one way to tell what sort of incident you’re having is to consider the customer impact. If there is no customer impact, the incident will take a lower priority. But, if there are high rates of customer impact, it’s time to call/Slack for backup.
No matter how severe the incident is, it’s important to keep calm and carry on. Emergency room doctor Dan Dworkis, MD PhD, wrote a piece on how to respond productively when things go wrong. He states “The first step is to acknowledge that what happened was, in fact, bad.”
Of course, we want fewer incidents, as we want to minimize customer impact. But we need to know how to go about resolving them without losing our minds. Dan suggests addressing this by using the phrase “Well, this is suboptimal.”
Dan gives an example of a car accident for how this mantra can come in handy. Imagine you’re involved in a car accident that damages your tire. You can’t continue to drive on it as normal, as the car won’t function. Something bad happened, and you need to address it. But, this doesn’t warrant you stepping out of your car in the middle of a busy street screaming and crying. It’s an issue with a tire, nobody died. The middle ground is to say “Well, this is suboptimal” and begin to resolve the issue.
Having this level-headed mindset during an incident can be a massive boon to your team, especially when you’re working together to decide what level of response is necessary. Situational awareness is key.
Effective communication skills
This one should come as no surprise, especially in the context of remote work. Communicating during an incident is a necessity. With distributed teams, it can be especially challenging to know who is doing what. Great incident response means communicating with teammates and superiors/customers as needed. This ensures that everyone is on the same page.
Great incident response is built on procedures. A very important part of communicating is telling your team what step of the procedure you’re working on. To begin, let your team know that you’re listening, active, and responding to the incident by checking in. Checking in, either on Slack or in your incident management platform, lets your team know that you’re on board. That simple gesture can create a lot of solidarity.
Another important part of incident response is streamlining communication with affected parties and internal stakeholders. Managers will want to be looped in on developments during the resolution process. If the incident is large enough, executives and customers will want to know the status of the service as well. There are two components to making sure this happens:
- Designating someone to take the reins for communicating to stakeholders. Managers don’t want repeat messages, or worse, mixed messages from different people. And customers won’t want tons of emails about your outage if nothing has improved. Selecting a single person to communicate developments minimizes the chances of wasteful overlaps.
- Communicating developments with your team and remembering to tag or @ the communications lead to ensure they see what’s going on. It's important to update your communication lead on your progress. This gives your stakeholders accurate visibility into the incident.
Incidents are tricky, and bad communication will only make them harder. Instead, focus on working together as a team and talking through the whole process. With remote work, this is important as you may be in different cities or even countries.
Compassionate responses to mistakes and a learning mindset
Every engineer makes mistakes; it’s how we learn. When an incident happens, it’s easy to place blame on the last person who pushed code. But, people are never the root cause of an incident; processes are. To be great at incident response, you will need to be compassionate in the face of these mistakes and learn from them.
Issues won’t only cause incidents; they’ll pop up during incidents. Sometimes a fix can cause more damage to a service than it repairs. You’ll need to learn to have compassion during these moments, too. Instead of getting angry with a team member, remember that they are trying to help. Everyone is making the decisions they feel are best at that moment. Support one another. The occasional emoji or GIF here and there can help create a sense of camaraderie. It also helps communicate that you know all mistakes were made with good intentions.
And once the incident is all said and done, it’s important that you take a closer look at it to learn. Great incident management comes from treating each incident as a learning opportunity. This will help you be more successful at resolving future incidents, and can even prevent some from happening.
Process is important here, too. Just because you and your team learned something doesn’t mean everyone else has. In fact, often only the people involved in the incident learn from it. The rest of the information is buried in files or forgotten. This problem is only exacerbated for distributed teams.
To make sure you capture your progress, write a comprehensive incident retrospective. You’ll need to help with aggregating all the key components (such as graphs, timelines, etc.) to form a narrative of what happened. With more data at hand, a clearer story begins to form and teams gain context without placing blame.
Great incident response is within your grasp
An important thing to note from all three of these components is that they are teachable. With experience, you can become a great incident commander as well. You can learn about incidents through participation and reading retrospectives. You can practice networking and inspiring people while keeping them focused. You can be intentional about language, look at things from different perspectives, and focus on improving processes without blaming people.
If you enjoyed this blog post, check out these resources:
Get similar stories in your inbox weekly, for free
Share this story:
Blameless
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.
Latest stories
How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …
A Review of Zoho ManageEngine
Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …
Should I learn Java in 2023? A Practical Guide
Java is one of the most widely used programming languages in the world. It has …
The fastest way to ramp up on DevOps
You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …
Why You Need a Blockchain Node Provider
In this article, we briefly cover the concept of blockchain nodes provider and explain why …
Top 5 Virtual desktop Provides in 2022
Here are the top 5 virtual desktop providers who offer a range of benefits such …
Why Your Business Should Connect Directly To Your Cloud
Today, companies make the most use of cloud technology regardless of their size and sector. …
7 Must-Watch DevSecOps Videos
Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …