Using Automation and SLOs to Create Margin in your Systems

Originally published on [Failure is Inevitable](https://www.blameless.com/blog).

With the difficulties we’re facing during this time, it can be difficult to keep up with the increasingly vast demand for our services. You need to make use of all the tools in your toolbelt in order to conserve your team’s cognitive resources. Two ways you can do this are through automating toil from your processes and prioritizing with SLOs.

Creating margin in the system to allow for adaptive capacity

Flexibility is crucial right now, but it’s difficult to create during a crisis if it didn’t exist prior to it. Organizations with less toil built into their processes are set up to succeed better than those with toil-intensive processes. This means it’s more important than ever to build some “margin” into our processes in order to remain flexible.

Brain space is at a premium during a crisis. With stress levels mounting, cognitive capacity is diminished. While teams may be too busy putting out fires to focus on automation, it’s actually more important than ever to decrease the cognitive load teams are facing. Additionally, automation can help build a buffer between the loss of productivity teams face during this crisis and the need to perform at an increased capacity. This can also increase the likelihood of the 50/50 engineering and toil split, giving you more room for innovation despite the constraints on resources.

Your team will also function better with decreased strain and toil. Richard Cook from Adaptive Capacity Labs notes that during this crisis, “Social spaces will become more tightly coupled. The effects of events and strains at work will transfer to home and vice versa. The influence of work on home (and home on work!) is usually moderated via social conventions. As stress saps energy it becomes more difficult to maintain boundaries.”

When toil becomes overwhelming, teams will lose energy and productivity. Automation helps build margin for your teams to recharge, take time with their families, and deal with this difficult time in a healthy way.

One way to bake in automation is with runbooks, easing incident response. Here are some key steps to consider when creating automated runbooks:

Understand and map your system architecture: To create runbooks that automatically use a variety of services, you’ll need to understand how each service functions and how they connect. Map these connections and include information on how automation tools can control each service to lay a solid foundation for future runbooks.
Identify the right service owners: Once you’ve mapped out your architecture, you’ll need a repository of the owners of each service. This will help future runbook authors contact the right people for collaboration, advice, and sign-offs. Complex automated runbooks will work through many service areas, so involving the owners and experts of each space is a must.
Lay out key procedures and checklist tasks: Common tasks often have common steps - subtask procedures like auditing, version control, and deployment are likely to overlap. Identify these key steps and clearly define their processes, then compile them into a list. Future runbook authors should use steps from this list when possible for consistency.
Identify methods to bake into automation: Now that you have a list of key procedures that recur in many tasks, you also have a great starting point for finding automation opportunities. Look for things that can be scripted, and ways to have scripts trigger subsequent scripts. Make your automated steps modular so they can be baked into a variety of runbooks.
Continue refining, learning, and improving: Resources like the architecture map, service owner repository, and list of common tasks aren’t to be created once and left untouched. Include updating these resources as a checklist task on procedures that would modify them, and also have regular checks to ensure they’re up to date. When you revisit them, take the opportunity to learn from them again, looking for new opportunities to automate and optimize.

In addition to automated runbooks, you can also use SLOs to help create margin through compassionate prioritization.

Using SLOs (compassionately) to drive prioritization

Margin can be built into your processes in other ways, too. One useful method is through error budgets and SLOs. SLOs are powerful tools to help align teams on how to prioritize engineering work against new features vs. reliability needs. This shared agreement is even more important now than ever. Richard Cook from Adaptive Capacity Labs predicts that during this crisis, “Tribalism will increase. Past success in producing a “no blame” and “learning” environment will come under severe pressure as the strain accumulates. Groups that previously worked in harmony may be at odds. Willingness to share productivity across groups will be sapped by the loss of resources and decreased performance.”

As teams experience unprecedented strain and are hit simultaneously with increases in unplanned work as well as reduced capacity, a game of tug of war could erupt. This means that even policies and metrics of success must change during this time. As such, SLOs and error budgets should be established with the team’s context in mind. As Alex said, “The best way to use the concept of an error budget isn’t that you have to actually have measurements, but rather that the concepts behind it give you a different way of thinking about things. And to have good discussions with people with that data and to help you make decisions based upon that.”

He also stressed the importance of revisiting a target whenever necessary: whether that’s due to an incident, change in code base, or a massive black swan event. Relaxing your error budget and compassionately setting flexible SLOs can help facilitate your team’s adaptive capacity, while improving shared prioritization of the work that matters most.

If you liked this blog post, check out these as well:

Get similar stories in your inbox weekly, for free

Share this story:

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.