Promoting Continuous Learning with SRE

With the extreme changes we’ve all been through these last several months, it should come as no surprise that our jobs have changed drastically, too. We’re working remotely. We’re dealing with increased resource constraints. Our services are receiving more traffic than usual, and we’re tasked with keeping things up and running. Our work-as-done may not match what we did at the beginning of 2020. However, by prioritizing continuous improvement and learning, we can work through these issues and build more resilient socio-technical systems.

Enabling continuous improvement through work-as-done VS imagined

Once you’ve figured out the logistics of working remotely, the actual work comes into focus. Prior to moving to remote work, you might have had your toil ratio at 50% or less and spent the majority of your time innovating. Perhaps you were shipping at an exceptional velocity. But now, you may be spending a lot of time on toil-heavy work, reducing capacity to keep up development velocity. Does your manager know this, as well as your team? What are the reasons behind this?

In a recent panel, Sr. Staff Site Reliability Engineer at LinkedIn Kurt Andersen explained it this way: “Work-as-done versus work-as-imagined is a concept that comes out of postmortems or incident retrospectives, and a learning-from-incidents mindset where one undertakes an exploratory journey to understand, as best you can, what actually happened as opposed to what you think happened.”

An example he raised is the vast amounts of time spent reviewing architecture documents that are essentially pre-implementation designs. He says, “One of the big problems is that those don't ever get updated to reflect the as-built system.”

When teams must not only manually dig through documentation, but find that it is not up-to-date, toil increases dramatically.

As Kurt Andersen said, “If you, as a senior director/VP of SRE, have this theoretical concept of SRE as 50% engineering and no more than 50% toil, that's a great construct to have in mind. But if that is your imagination and it doesn't align with the way that work is actually being done by the teams, you are setting your teams up for burnout and frustration and yourself for frustration when you can't align with what is actually hurting the teams and costing them time, effort, and brain space.”

Managers or team leads need to stay alert to changes in operation and toil levels, and respond compassionately. As Kurt said, “If you want to implement something to make your SRE teams’ lives better, it's really important to understand what those lives actually consist of before you try to make them better, not just what you hypothetically want them to be… if that your imagination doesn't align with the way that work is actually being done by the teams, you are setting your teams up for burnout and frustration and yourself for frustration.”

Communication here is key. Managers and leads need to be ready to listen, even more so when teams are remote. Strong communication is crucial to surface insights that make your teams and organization more resilient.

Learning for the long term

The organizations that are able to learn, pivot, and make improvements are the ones that succeed despite extreme circumstances. Organizations that struggle to embrace learning are less likely to recover from failure. In an SRE panel, Sr. Director of Engineering at Google Dave Rensin gave his thoughts on how this will be reflected given the current circumstances: “The ones who don't learn will self-select for extinction.”

This means making adjustments to culture and processes. Dave also notes, “The most useful thing we can do from here is to ask what principles, what practices, what cultural norms do we want to drive into our companies, so that the next generations don't have to remember the specifics of this incident in order to get the value of what we learned from it.” But how do we bake continuous learning into our processes?

SRE at Aurora Craig Sebenik has an example. He spoke with us about how he has learned to vet and onboard new Saas vendors through a constantly evolving process, iterating on best practices from larger organizations. According to Craig, “One of the biggest things you can do is try to take lessons from the big companies, not because they're necessarily doing things right, but because they've seen all kinds of weird things. But don't take them wholesale. Take what they say, figure out what applies to you. Take those pieces and continue to evolve them. Don't remain static.”

Iteration is key, especially after unexpected events like incidents. Treating crises as learning opportunities can help organizations adapt and prepare for the future of work.

Conclusion

Richard Cook from Adaptive Capacity Labs predicts that “The next 6 to 12 months will be a period of intensive learning. The organizational buffeting that is in store for us will reveal much about how well and how poorly our systems can function. It is essential for survival to learn quickly what does and does not work, what new vulnerabilities are cropping up, and how to reshape the tech to meet these demands.”

In times of uncertainty, weaknesses in the system are brought to the fore. The first instinct is to regard them as failures, but they should also be acknowledged as opportunities. What remains constant is the need for growth, learning, compassion, and a focus on the well-being of humans. If we can anchor on these values, we will set up our teams and organizations for success, regardless of how unexpected circumstances take hold in the future.

If you liked this article, check these out:

Get similar stories in your inbox weekly, for free

Share this story:

Blameless

Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.