How to Scale End-to-End Observability in AWS Environments

Getting SRE Buy-in from a Manager or Lead for Incident Response

in Incident Management

In this blog post, we'll walk you through crafting a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.


    Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and more. In this blog post, we'll walk you through crafting a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

    The situation

    As one of the first steps towards SRE adoption, incident management is key. You want to implement an effective incident management system within your team. Now it’s time to convince your lead/manager. How will you accomplish this?

    First, we need to recognize that your manager will need a lot of support from engineering and DevOps teams for this transition. These teams will need training in this incident management system to use it each time an incident occurs.

    Second, you need to define what you mean by incident management. We'll define incident management as the assembling, investigating, resolution, and learning process. This includes incident response playbooks, measuring time to detection, monitoring systems, and ticketing workflow.

    Once you have a handle on the basic proposal, it’s time to think about what the team (manager included) will gain from an incident management system.

    The incentives

    There are four incentives that will motivate your team to adopt incident management best practices:

    • Incident management best practices restore your systems as fast as possible when an incident occurs.
    • A playbook gives everyone a sense of control amidst the chaos. It defines a set of repeatable practices to drive consistency while helping everyone to be thorough with their problem-solving.
    • Measuring time to resolution (TTR) and time to detection (TTD) allows the manager to quantify the team’s improvement on TTR and TTD moving forward.
    • Integration with alerting and ticketing systems reduces context switching between different apps. This lowers the stress from mentally keeping track of many systems.

    Yet, explaining these incentives to your manager and hoping for immediate support will not guarantee buy-in. You need to anticipate the resistance your manager will have towards this big change.

    The resistance

    Your manager might say, “Our current process is manual but good enough.” OSAGE syndrome, or “Our Systems Are Good Enough” can be difficult to overcome. It’ll be up to you to change your manager’s mind and convince them that it’s time for something better than “just okay.”

    To make this argument, you’ll need to rely on both a factual, logical appeal, as well as an emotional one. While there is no one right answer to solve this problem, as every organization, team, and manager is different, there are some topics your manager might connect with better than others.

    Here, you’ll have to empathize and put yourself in your manager’s shoes. What would motivate you?

    The emotional appeal

    If you were responsible for a whole team and a major incident occurred, what would your first emotion be? Most likely, you would be afraid. While a culture of fear is not what you want when adopting SRE, it can help spur the adoption of important best practices. After all, if new processes can help reduce your manager’s fear by establishing safeguards and preparedness, that would appeal to them.

    One of the major sources of fear is loss of control. When an incident occurs, current manual processes fail. With the move to microservices, it can be hard to understand where the incident originated, and how to mitigate it. Rollbacks are an option, but they don’t solve the underlying problem. Your manager is accountable for the service returning to normal efficiency and answering why this happened in the first place.

    This responsibility is a considerable challenge. With a better incident management system, your service can be functioning quicker. And with automated runbooks, resolving the incident can requires minimal chaos. Faster and more consistent incident resolution can help your manager regain some control.

    Another source of fear is losing your team. If your teammates are waking up at 2:00 AM with no end in sight, morale will be low. Additionally, manual processes are toilsome and stressful. The team wants to see the process getting less stressful over time, not worse as the number of services increases. Operational complexity is inevitable, but if that results in more incidents and unplanned work, that will lead to burnout as well as unhealthy team culture.

    People will begin searching for other employment options if these issues are not resolved. When headcount drops and turnover rates soar, your manager will need to keep the ship sailing while drowning in the labor-intensive process of backfilling, hiring, and onboarding new engineers. This cycle is not sustainable, and is enough to keep your manager up at night.

    The logical appeal

    This is where you’ll need to tackle OSAGE syndrome. When your manager says, “the current process is manual but good enough,” ask them if all the process’s consequences are intended. Are the repetitive 2 AM calls purposeful? If the answer is no, then your system is not good enough.

    It’s important to not blame your manager for these struggles. After all, some of these issues are beyond their control. Systems have become more complex, and the bar is higher than ever. Instead of pointing fingers, it’s time to lay on some more logic. For this, you’ll need to provide your manager with two important to promote adoption:

    A service catalog for the number of services/microservices you have and their dependencies. Show how these have grown and will continue to grow.

    During the new IM proof of concept phase, you’ll need to track the trends of TTD and TTR. If there are positive results, then you can justify rolling out the system and process changes to more teams.

    Armed with emotional and logical appeals, you can approach your team lead and discuss improving your incident management system. This is a great first step towards SRE adoption, but you can’t stop here — you’ll reach a local maxima that falls short long-term. You’ll need to think about how to frame SRE adoption for the next level of leadership to gain the buy-in you need.

    If you enjoyed this blog post, check out these resources:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …