SRE for Business Continuity in the Face of Uncertainty

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. But, with SRE best practices, your team can embrace resilience and adapt.

Embrace resilience with incident management procedures

One way to think of these difficult circumstances is to envision them as an incident. Incidents are forms of unplanned work, and crises fall under that category. To deal with incidents, you likely refer to runbooks when you’re under pressure. These key components to incident response are also applicable when dealing with uncertain circumstances.

You’ll still need accurate organization charts to know who works in what department, how to contact them, and how to escalate issues when necessary. You still need playbooks to execute on day-to-day issues. But you’ll need to adjust these to better reflect the current reality.

Your new runbooks will need to revolve around working from home. Key information will include meeting protocols, agreed upon hours or results required per day or week, and how to communicate with your team. This will require flexibility. Some people will prefer the standard email, others will want messages via Slack. You’ll also need to determine what discussions merit calls, and who needs to be notified in the event of impromptu meetings.

Besides creating new runbooks, you’ll also need to review how you deal with illness and family emergencies. Capacity will need to increase (for servers, storage, etc) but also for headcount. It will be important to plan for how your team will function when members fall ill, or need to care for family and friends.

Revise your on-call schedules

On-call will need revisions as well. With the increased strain on your infrastructure, incidents could spike. If some engineers are on call for the times with highest usage, they could be the ones responding to the brunt of the issues. This mental strain could lead to burnout, and weakened immune systems due to stress. Instead of tracking the time a person spends on call, begin a qualitative analysis. If someone spends a day on call and is paged only once, it might seem like they could help load balance against someone who was on call for an entire weekend and was paged three times. But, if the three outages lasted only one hour a piece and the single outage lasted 16 hours, person No. 1 will need more rest.

Planning for times of uncertainty and being flexible can improve your business continuity. If you need a little help getting organized, Hubspot has a business continuity template.

Write incident retrospectives to understand recurring issues

Outages during this time are unprecedented as companies struggle under increased demand. With the switch to WFH, even Microsoft teams had an outage lasting two hours. Gaming, stock trading sites, and corporate VPNs are of significant concern with the influx of daily users. Incidents are cropping up at an alarming rate.

In fact, all services (from internet providers to grocery stores and health facilities) are getting stretched to capacity, and can’t afford to keep making the same mistakes. With the increased volume of incidents, it would be easy to skip over the retrospective. But, this is one of the worst pitfalls of firefighting. By skipping the retrospective, you lose the opportunity to learn from incidents and prevent them from occurring again. Crises don’t have a set end-date. If you don’t begin working down same-class issues soon, eventually you will be overwhelmed.

By writing retrospectives and working your way through a root cause analysis, you’ll be able to identify two ways to speed up your processes:

Identify bottlenecks. Is there a recurring stopping point for services being improved or incidents being resolved quickly? Bottlenecks can be people or processes, and it’s important to know which one you’re dealing with. For example, in Gene Kim’s “Phoenix Project” Brent was a huge bottleneck. As a gifted engineer who dabbled in all aspects of the service, Brent was a constant go-to for any issue. This meant he spent most of his time on unplanned work and undocumented request. This overloaded him and slowed down system-wide improvements. In situations like these, it’s important to make sure engineers feel empowered to say no, focus on project work, and get some quality heads-down time.

If the bottleneck is a process, you’ll need to review your workflows for that particular process. While this sort of work is less visible, it’s important to efficiency and innovation. Without bottlenecks, you’ll be able to improve your service and resolve incidents faster. It’s well worth calling a meeting to work through. And you’ll need retrospectives to make these informed decisions.

Automate toil. Writing retrospectives can also help you understand where you’re losing time to toil. For example, for a 15 minute outage, if 5 minutes are spent getting participants filled in on the issue, 33% of your MTTR is toil. You could automate the incident resolution process to generate a communication hub for your incident to fill others in on the details. Additionally, how much time do you spend writing postmortems? Do you spend hours searching for disparate information to include in your timeline? This is toil as well. By using a tool to aggregate key data for you, you and your teammates are free to do the important part: learning.

Learn continuously to adapt

Embracing resilience also requires flexibility of thinking and learning. If you allow key opportunities to pass by, you’ll miss the opportunity to learn flexibility. This adaptation is crucial during times of crisis and uncertainty. Business can’t proceed as it used to. We need to iterate on our process, behaviors, and mindsets to thrive.

The first step in flexibility is a mindset change. You’ll need to learn to be patient with others. Many of your coworkers are now working from home. This means there are pets, partners, and children to deal with. This isn’t an ideal working situation for most. It’s high-stress and distracting. Your team member who used to reply to your Slack messages in two minutes now might need 15-20. And that’s okay. Meetings might be a little tougher with busy households, and that’s okay. Productivity might dip while people learn how to operate in this new normal, and that’s okay. We must be patient with each other and ourselves while we adapt.

You’ll also need to consider how to be flexible in creating new team dynamics. In the office, you know how your teammates take their coffee and what they did last weekend because you have a break room that allows for this level of connection. Without that, how will you keep your team talking? Fun slack groups, team water coolers via Zoom, and virtual game night all become so important here. Not only because they keep you feeling like part of the same team, but because camaraderie is so important while social distancing. Human connection keeps us motivated. Knowing that someone else counts on us can keep us working even when we're overwhelmed.

Lastly, you’ll also need to be flexible in your learning resources. Cancelled conferences, lack of internal continuing education, and classes either postponed or moved to online means you might be suffering from a knowledge drought. It’s more important than ever to find safe, healthy ways to learn and interact with the community. This could be attending virtual conferences, weighing in on live panels, or reading industry news. Some of our favorite resources are:

Get similar stories in your inbox weekly, for free

Share this story:

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.