Minimizing SPOFs During Summer Slowdown
Originally published on Failure is Inevitable.
Between COVID-19 and the typical summer slow down, offices are emptier than they’re ever been. With team members taking some much-needed time off, it’s important to know how your team will be affected. Here are some tips to help your teams function during this time of flux.
Minimizing SPOFs (single points of failure)
It’s important to know who you can count on in a crisis, but what happens if that person gets sick or needs to tend to their children or loved ones? It’s critical during times of upheaval that you can function in the event of team members needing to take PTO or sick leave. This means you’ll need to determine who your SPOFs (single points of failure) are. If this person needs to take a day, or perhaps even a week off, what context will you miss?
At Google, teams practice an exercise called the “Wheel of Staycation” to ensure that SPOFs are discovered prior to a crisis situation. In previous talks, Dave Rensin has also spoken on how to implement this exercise in your organization. In a nutshell, once a week, a single person from your team is selected to get a ‘staycation’ where they are unable to communicate with the rest of the team.
- For the person in question, this time can be used for deep project work.
- For everyone else, it’s a test. Are there questions you can’t get answered, or blockers in your workflow? If this is the case, you have a SPOF to eliminate.
Eliminating SPOFs is important after completing the exercise. To do this, you’ll need to track the asks that your staycation team member received in one day. You could do this by setting up a staycation slack channel where all questions you would have asked your teammate are listed. Or you could simply tell your team to ask those questions via email or slack to your stay-cation team mate as usual and have that person create a list when they come back from their stay-cation. Then, the staycation team mate should be required to make sure all the context and key information the team needed is baked into process docs or confluence pages, so the SPOFs are eliminated.
This method should be practiced as a preventative initiative. According to Dave, “There are other things you can do, but the only way you can discover things like expertise SPOFs or information SPOFs is to regularly and routinely exercise them before the emergency shows up.”
If you wait too long, you’ll only spot SPOFs when it’s a true emergency, at which time it’ll be too late.
Preparing for staffing reductions and changes to continuity plans
Knowing and eliminating SPOFs is key during any crisis, but it is especially important when staffing reductions come into play. These reductions can be for business or financial reasons, or due to decreased cognitive capacity as people deal with personal matters (anxiety, family needs, health challenges, etc.). Either way, you will need to be able to adapt to these reductions and make adjustments to your continuity plans. Here are three crucial steps to working through this challenge:
- Revise your on-call schedule: Perhaps your organization ran a weekly on-call schedule where you would take turns carrying the pager for a week. Now, with less people on board, your rotation might have gone from once a month to once every two weeks. With the additional burden of keeping services running, the challenges of WFH, and the strain that the current crisis adds to everyone, this might become overwhelming. Being on-call for a full week might be too much of a burden. It’s time to talk to people about how they’re feeling, take a look at the incident metrics and the time spent per engineer on incidents. If these numbers are higher than normal and team members are reporting higher levels of stress, you’ll need to adjust to make the situation easier to bear. Knowing when your team members are overwhelmed and adjusting accordingly is crucial to mitigate burnout and enable long-term success.
- Know the difference between doing more with less and overworking: Everyone is talking about doing more with less. Many organizations are focusing on getting down to the ‘essentials,’ with respect to tooling, work perks, and more. However, as capacity is reduced, it’s important to acknowledge that productivity will take a hit. It's unrealistic to assume that teams can work at the same clip compared to before a black swan event. However, this does not mean sacrificing quality. Strategic prioritization is more important than ever. In times of uncertainty, focus on quality over quantity and speak with leadership about how to adjust goals and metrics for performance during this time.
- Become comfortable with being uncomfortable: During this crisis,all of our previous continuity plans have proven to be insufficient As unknown unknowns are near impossible to account for, no amount of planning will be able to completely prevent impact, and even the most thorough continuity plans will require adaptability. In short, you’ll need to become comfortable with being uncomfortable. As Liz Fong-Jones has shared, “You can't enumerate every single possible thing that's going to go wrong. The playbook strategy is not necessarily going to work super well because you cannot anticipate what the next black swan is going to be. So we have to focus on making our organizations of people more resilient.”
While all of these things are easier said than done, they are a good starting point for pivoting in this new reality.
If you liked this article, check these out:
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.