This Is the Most Underappreciated Skill for SREs
In this blog post, we’ll highlight some examples of glue work SREs perform: building a common language, forging connections, and establishing culture.
Originally published on Failure is Inevitable.
Delivering great software and sustainable systems is a team sport. Without the support of all stakeholders, adoption initiatives often fail. In successful initiatives, SREs are responsible for bringing together all resources and team members to help resolve reliability-related issues.
But getting together these resources takes much more effort than people think. SREs engage in lots of glue work to ensure these collaborative efforts happen. Glue work refers to tasks that are essential to a project’s success, even if they don't contribute to the codebase.
Unfortunately, glue work often goes unrecognized as it can be more difficult to measure. It's important to learn what goes into glue work and focus on ways to appreciate those who do it. In this blog post, we’ll highlight some examples of glue work SREs perform: building a common language, forging connections, and establishing culture.
SREs align stakeholders’ goals with common language
For a project to succeed, people must be aligned around a shared goal. During complex projects where many stakeholders contribute, this can be challenging. Consider an organization-wide goal of boosting customer retention by 25%. Let’s look at how that goal might manifest in different teams:
Although each team’s work shares an ultimate goal, the tasks they work on end up being very different. Teams can become demotivated if their work isn’t reflected in the overall success of the organization. Additionally, teams can lose sight of the main goal: increased customer happiness.
SREs can help glue these projects together by establishing a common language. In episode 5 of the Resilience in Action podcast, Eric Roberts, Sr. Manager of SRE at Under Armour, discusses how important this was to his team's success. “We really needed a framework to get alignment across these teams, because they don't all work the same way. We also needed alignment on how we measure success for ourselves.”
Eric recommends SLOs and user journeys as a unifying language between teams. These provide metrics that illustrate customer satisfaction and represent organization-wide goals. At the same time, they also reflect changes made on specific technical projects. This helps the organization move together towards one goal. This achievement is so powerful that Eric describes it as his most gratifying. “For me, [the most gratifying thing] is establishing an idea or a goal and convincing everybody that this is the right thing to do. You see the momentum turn the corner and then everybody's talking about it.”
SREs build this common language through glue work practices such as:
- Developing shared classifications and policies for projects and incidents.
- Codifying knowledge by writing runbooks that are usable by many teams.
- Consolidating monitoring data into something meaningful to all stakeholders.
- Creating reports, stories, and presentations that make findings more accessible.
- Shoring up documentation by building standards and filling in gaps.
SREs bring people together in inspiring ways
One of the clearest examples of SRE glue work is bringing people together who might otherwise not meet. On the Resilience in Action podcast, Principal SRE at Gremlin Tammy Bryant talks about the game days her team runs at Gremlin. “We were running these really awesome game days where we would invite the entire company to come along and see... That actually worked really well for a long time.”
As Gremlin grew, the team needed to scale game days. The team worked together to create a new plan. “We use the Donut bot to match people into mini game days with three people running a game day together. The engineers are running the game day and we're there to take their feedback always. Everyone gets to say what they think should be done to improve. That's a really big thing I think is important. You've always got to listen to everyone. Because if people don't like it or don't want to do it, then you're going to hear about it. That's the same for every single SRE practice.”
This is a key mentality: give everyone a chance to share their feelings, as well as a chance to listen. Rather than run from it, embrace any criticism that emerges. These discussions encourage people to air grievances from their unique perspectives. They contextualize their challenges in a way that the other participants can understand. This allows team members to build an empathetic bridge, and dig deeper into incident contributing factors or spirited debate of ideas
Though people can be wary of adding more meetings to the calendar, this is a critical opportunity to connect. Pitch these gatherings in a way that stimulates creativity. Eric recommends rebranding them as brainstorming sessions.
SRE practices lend themselves well to generating these opportunities to meet. Documentation, such as incident retrospectives, is built and reviewed collaboratively. Chaos engineering and other experiments require planning and review meetings. Inviting people typically outside these teams can forge fruitful bonds. And allow internal stakeholders greater insight into the hard work the engineering team is doing.
SREs grow an empathetic, trusting culture
In the above examples, a common element was a cultural shift motivated by a practical change. This culture building is the most valuable part of SRE glue work, but also the most challenging. On the Resilience in Action podcast, Equinix Staff SRE Amy Tobey explains: “It always seems that the hardest part of doing SRE work isn't the technical stuff...These implementation processes are almost more of a cultural change than a technical implementation.”
It was a lesson learned out of necessity, as she describes the process of “hitting heads against the wall” in trying to improve reliability with technology alone. Finally, she had an epiphany. “If I'm going to fix this, I've got to do people work.” The glue work of SREs may not always seem connected to the bottom line. But in making these connections, a much more successful culture can emerge.
In her presentation Being Glue, Tanya Reilly explores the culture that glue work can create, and the challenges of taking on glue work. Glue work often falls on people who volunteer to complete it. People don’t volunteer equally—for example, a study showed that women volunteer 48% more often than men for work that is “non-promotable”. As glue work is often overlooked when assessing eligibility for promotion, it is susceptible to such biased distribution.
Recognizing and appreciating glue work needs to be foundational to your organization’s culture. People ending up responsible for work that will receive no recognition is a surefire route to burnout. Creating systems to fairly divide glue work is itself glue work. Those who take on glue work are also often responsible for anticipating and managing burnout. It isn’t a system that can correct itself, but one that requires the entire team to behave with empathy.
SREs can also help people practice empathy and trust in many other circumstances:
- When things go wrong, rather than point fingers, address the issue blamelessly.
- When goals seem misaligned, seek common ground in user satisfaction.
- Hear others’ unique perspectives and connect their challenges with yours.
When you consider the costs of lack of trust, which has heavy consequences such as attrition, the value of this empathetic culture becomes obvious. SRE glue work builds practices that encourage these modes of collaboration. Tools like SLOs and incident retrospectives help people align their goals. With their goals aligned, engineering teams generate conversation, sparking new ideas and digging into issues. The cultural foundation of your organization is based on this glue work, so don’t overlook it.
Blameless provides the tools to make glue work for SREs easier, as well as more recognizable. Want to see how you can build a connected and empathetic culture while boosting reliability? Check out our customer stories such as Vital ER’s process and culture transformation. And if you’re ready to give Blameless a spin, try our free sandbox.
If you enjoyed this blog post, check these resources out:
Get similar stories in your inbox weekly, for free
Share this story with your friends
The improved AWS feature allows users to trigger Lambda functions from an SQS queue.
United States Defense Department Asks Amazon, Google, Microsoft, and Oracle to Bid on the JWCC Program
DoD looking to entrust cloud security to multiple vendors.
Google makes fuzzing easier and faster with ClusterFuzzLite
HTTP-based autoscaling and scale to zero capability on a serverless platform