How to Scale End-to-End Observability in AWS Environments

How to Build Your SRE Team

How to Build Your SRE Team.png

In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.


    As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs, to management upholding the virtue of blamelessness, to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

    However, while it’s not necessarily to have people with the title of ‘SRE’ in order to successfully adopt the best practices of SRE, having people who are dedicated to stewardship of SRE practices is important to achieve reliability excellence. In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

    Common pathways to becoming an SRE

    When looking for people to fill your SRE team, looking only for self-described SREs may be too limited in scope. People from a wide variety of backgrounds can learn the tenets of SRE while also benefiting from their unique expertise. Here are some examples of career paths that could make people a great fit for SRE:

    • Software developers understand the value of reliability metrics and are accustomed to solving optimization problems based around them
    • System administrators take a holistic perspective to entire system architecture and proactively address reliability issues, such as potential downtime
    • System design engineers create efficient procedures through complex systems, helpful for coming up with runbooks and other incident responses
    • Quality assurance engineers have a test-oriented mindset, ensuring that systems stay reliable in the most adverse conditions
    • Database administrators are accustomed to optimizing the storage and reliability of huge data systems, making them well suited for the challenges of reliably scaling

    Although SREs commonly emerge from other tech disciplines, the mindset of SRE can be appreciated and embraced by anyone. SREs can emerge from backgrounds as diverse as communications, business studies, and the arts. By looking at their own challenges of reliability through the lens of SRE, they can contribute unique insights.

    SREs as engineers of reliability

    SRE is a holistic discipline that involves many skills outside of writing code. Nevertheless, a major role your SREs will play is that of a software engineer, building systems and software to improve reliability. Ensure that your prospective SREs understand the languages and architecture of your systems. Even if they aren’t writing much code, they need to understand how development decisions impact systems.

    SREs can work with development teams to “develop for reliability.” This involves considering how development will impact key reliability metrics, measured by SLOs and error budgets. As these metrics reflect the most fundamental levels of coding and architecture configuration. To work with them, SREs will have to understand how potential development directions will impact the entire stack, from top to bottom.

    Because of this “big picture” approach to development, looking for SREs with a strong systems engineering background can be helpful. In “Hiring Site Reliability Engineers” for login, the USENIX magazine, Google employees Chris Jones, Todd Underwood, and Shylaja Nukala detail their technical hiring process for SREs. They break down how SREs with the ability to form connections throughout complex systems can make up for missing expertise of the specific software systems. Through a combination of holistic systems analysis and detailed examination of ramifications in the lowest level of code, SREs can fully understand the relationship between development and reliability.

    SREs with the ability to form connections throughout complex systems can make up for missing expertise of the specific software systems.

    SREs as stewards of reliability

    At the heart of decision-making is data. Without complete and accurate data about how your system operates, it’s impossible to know where to prioritize development efforts. Another key role of SREs is collecting, refining, and analyzing this data. There are many monitoring tools available that can help extract and visualize data from your system. SREs can transform this data into something actionable.

    A key example of this transformation is creating SLIs. SLIs combine low-level monitoring data into a single metric that reflects business impact, which is then used to set SLOs. SLIs and SLOs should be determined and reviewed by large teams of people, but SREs can be a bridge of knowledge for those teams. SREs can connect different domains of technical and business expertise to find the most impactful indicators and guardrails of reliability.

    Monitoring data and indicators built from it, like SLOs, should be readily accessible and comprehensible to your entire team, but SREs will have a special relationship with it. Acting as the stewards of reliability, they advocate for maintaining SLOs and other best practices to shift quality left so teams can scale sustainably When hiring SREs for this role, don’t just look for expertise in your particular monitoring or other tools, as your tech stack tomorrow will look different than the one today. Instead, look for people who understand the importance of putting data in the right context, and how to persuade others to adopt best practices. 

    SREs as leaders who align reliability with business needs

    As you develop your SRE solution, you’ll find yourself building up a framework of policies and practices: review and revision cycles, ownership maps, incident classifications and response procedures, and more. These should be understood, agreed upon, and adopted by the entire team, but SREs can serve as leaders for fine tuning and keeping this framework operating. SREs can help develop and implement procedures in many areas of SRE, including:

    • SLO and error budget review
    • Incident classification review
    • Runbook creation and review
    • Incident retrospective practices
    • On-call scheduling policies
    • Security audits
    • Chaos engineering test procedures

    For each of these categories, all stakeholders should be consulted. SREs serve as a holistic bridge between their domain expertise and the greater impact on reliability metrics for the entire business.

    Once policies are in place, SREs can take the lead on ensuring they’re followed. As “reliability educators,” SREs can conduct internal audits to make sure incident retrospectives contain the necessary data, that follow-up tasks are being completed, that runbooks are having scheduled updates, etc. Of course, these audits would be conducted blamelessly—in socio-technical systems, if certain procedures aren’t being upheld, it isn’t the fault of the individuals not following them, but likely how the procedures themselves have been set.

    SREs in this role don’t need to be experts on every category listed above, although some familiarity is necessary. Each team’s adoption of best practices will be unique, and teams should embrace context over control so team members are empowered to make the best decisions they can in dynamic situations. Most importantly, SREs need a collaborative attitude and a willingness to consider the concerns of others. Ask prospective SREs how they’d handle disagreement in how a policy should be developed, or an incident where people were found to be negligent in following policy. Understanding their attitude in such situations can be just as useful as their technical expertise.

    SREs as ambassadors of reliability culture

    SREs should embody the cultural lessons of SRE in every role they play. Decisions they make around policy or development should always reflect these values—not just implicitly, but as a stated element of their decision-making process. That means fostering an environment of empathy, ownership, and trust.

    It can be difficult to determine how a prospective hire will align to these cultural values. These beliefs are unlikely to show up on someone’s resume or transcript. To start diving deeper into their attitudes, here are some questions to consider when evaluating a prospective SRE:

    • How do they approach reliability goal setting?
    • What value do they place on failure?
    • How do they work to attribute error without blame?
    • What about incident retrospectives do they find most valuable?
    • How do they approach situations where others may come to them with concerns about teams, reliability, or other subjects?

    Try asking about hypothetical situations where these beliefs would be tested. Ask them to explain the values behind their decisions, then probe even further, asking why these values are beneficial to the team. Experienced SREs should be able to connect policies with cultural and business outcomes, and advocate for healthy reliability practices. For example, they’ll be able to connect the dots between investments in things like SLOs, documentation, and toil automation, and how they ultimately lead to shared context and improved morale. 

    Common team structures

    As you start an SRE team, where your SRE team sits in context of the rest of your engineering organization will depend on your organization's operational maturity, culture, and needs. Here are a few of the common structures:

    SRE model with dedicated engineers focused on infrastructure and/or tooling (shared services, observability, etc.)

    This configuration has the SRE team reinforcing efforts across the organization. They maintain services used by many different development teams without focusing on any specific project. With this configuration, the productivity and reliability of many projects can be improved at once. However, there may not be the resources to address specific reliability needs.

    Embedded SRE model where full time SREs are assigned to a product/service

    In this configuration, each product or service team is assigned some number of SREs to address their specific reliability requirements. This allows greater flexibility in allocating SRE resources—you can focus on areas with the biggest business impact. Embedded SREs should still take care to communicate to maintain consistency in their practices.

    Distributed SRE model of SREs as consultants or stewards of reliability standards

    This configuration has your SREs serving as consultants for reliability issues across the organization. SREs can still be centres of knowledge for particular products or services, but aren’t embedded into development teams. Instead, they work to keep services to agreed upon reliability standards, and consult with engineers to achieve them.

    No matter how large your team grows, you’ll find that good tools will empower your SREs to respond to incidents effectively and develop for reliability proactively. To see how Blameless can help level up your SRE solution, check out our demo!

    If you enjoyed this blog post, check out these resources:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …