Getting SRE Buy-in from a Manager or Lead for Incident Response
In this blog post, we'll walk you through crafting a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.
Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and more. In this blog post, we'll walk you through crafting a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.
As one of the first steps towards SRE adoption, incident management is key. You want to implement an effective incident management system within your team. Now it’s time to convince your lead/manager. How will you accomplish this?
First, we need to recognize that your manager will need a lot of support from engineering and DevOps teams for this transition. These teams will need training in this incident management system to use it each time an incident occurs.
Second, you need to define what you mean by incident management. We'll define incident management as the assembling, investigating, resolution, and learning process. This includes incident response playbooks, measuring time to detection, monitoring systems, and ticketing workflow.
Once you have a handle on the basic proposal, it’s time to think about what the team (manager included) will gain from an incident management system.
There are four incentives that will motivate your team to adopt incident management best practices:
- Incident management best practices restore your systems as fast as possible when an incident occurs.
- A playbook gives everyone a sense of control amidst the chaos. It defines a set of repeatable practices to drive consistency while helping everyone to be thorough with their problem-solving.
- Measuring time to resolution (TTR) and time to detection (TTD) allows the manager to quantify the team’s improvement on TTR and TTD moving forward.
- Integration with alerting and ticketing systems reduces context switching between different apps. This lowers the stress from mentally keeping track of many systems.
Yet, explaining these incentives to your manager and hoping for immediate support will not guarantee buy-in. You need to anticipate the resistance your manager will have towards this big change.
Your manager might say, “Our current process is manual but good enough.” OSAGE syndrome, or “Our Systems Are Good Enough” can be difficult to overcome. It’ll be up to you to change your manager’s mind and convince them that it’s time for something better than “just okay.”
To make this argument, you’ll need to rely on both a factual, logical appeal, as well as an emotional one. While there is no one right answer to solve this problem, as every organization, team, and manager is different, there are some topics your manager might connect with better than others.
Here, you’ll have to empathize and put yourself in your manager’s shoes. What would motivate you?
The emotional appeal
If you were responsible for a whole team and a major incident occurred, what would your first emotion be? Most likely, you would be afraid. While a culture of fear is not what you want when adopting SRE, it can help spur the adoption of important best practices. After all, if new processes can help reduce your manager’s fear by establishing safeguards and preparedness, that would appeal to them.
One of the major sources of fear is loss of control. When an incident occurs, current manual processes fail. With the move to microservices, it can be hard to understand where the incident originated, and how to mitigate it. Rollbacks are an option, but they don’t solve the underlying problem. Your manager is accountable for the service returning to normal efficiency and answering why this happened in the first place.
This responsibility is a considerable challenge. With a better incident management system, your service can be functioning quicker. And with automated runbooks, resolving the incident can requires minimal chaos. Faster and more consistent incident resolution can help your manager regain some control.
Another source of fear is losing your team. If your teammates are waking up at 2:00 AM with no end in sight, morale will be low. Additionally, manual processes are toilsome and stressful. The team wants to see the process getting less stressful over time, not worse as the number of services increases. Operational complexity is inevitable, but if that results in more incidents and unplanned work, that will lead to burnout as well as unhealthy team culture.
People will begin searching for other employment options if these issues are not resolved. When headcount drops and turnover rates soar, your manager will need to keep the ship sailing while drowning in the labor-intensive process of backfilling, hiring, and onboarding new engineers. This cycle is not sustainable, and is enough to keep your manager up at night.
The logical appeal
This is where you’ll need to tackle OSAGE syndrome. When your manager says, “the current process is manual but good enough,” ask them if all the process’s consequences are intended. Are the repetitive 2 AM calls purposeful? If the answer is no, then your system is not good enough.
It’s important to not blame your manager for these struggles. After all, some of these issues are beyond their control. Systems have become more complex, and the bar is higher than ever. Instead of pointing fingers, it’s time to lay on some more logic. For this, you’ll need to provide your manager with two important to promote adoption:
A service catalog for the number of services/microservices you have and their dependencies. Show how these have grown and will continue to grow.
During the new IM proof of concept phase, you’ll need to track the trends of TTD and TTR. If there are positive results, then you can justify rolling out the system and process changes to more teams.
Armed with emotional and logical appeals, you can approach your team lead and discuss improving your incident management system. This is a great first step towards SRE adoption, but you can’t stop here — you’ll reach a local maxima that falls short long-term. You’ll need to think about how to frame SRE adoption for the next level of leadership to gain the buy-in you need.
If you enjoyed this blog post, check out these resources:
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.