History, Principles, and implementation of SRE

in DevOps , Kubernetes , DevSecOps , GitOps , Data Engineering

Site Reliability Engineering (SRE) refers to a set of practices incorporated into operations using the same approach used in software building.SRE implementation in a company fast track growth by providing seamless operations between the various teams in the organization.

It is often done by introducing automation or structure that streamlines the effort and focus of teams towards a common goal. To fully understand the functionality of these practices, let’s look at why and how they came about.

A brief dive into the history of SRE practices.

Site Reliability Engineering was introduced by Google Executive, Ben Treynor Sloss. He was to oversee the running of a production team that would ensure the reliability, availability, and ease of use of Google’s services. They had a team of seven people who included operations as part of their tasks to understand better how the software works.

The work method of the team was focused on reducing points of friction that they frequently experienced with the operations team. They did this by creating a connecting point between both teams.

Both teams had different goals that would often clash during implementation. And introducing this mode of operation helped define the goals of the team whenever a new software was to be released.

Since its introduction, it has since morphed from just balancing out operations and production to creating systems all teams within the company can align with.

Google shared this seemingly efficient concept with the rest of the world in 2003. A survey conducted in 2021 by the DevOps Institute noticed that 22% of companies out of 2000 had since adopted it.

Understanding SRE principles.

The seven main principles that guide SRE are explained in Google’s SRE handbook, and we’ll look into them individually.

Consistency

Ensuring that every practice is followed across the different teams with the same focus point is the goal of SRE. Ensuring consistency involves the following points:

Choosing targets
Avoiding absolutes
Putting control measures in place
Having few service level objectives.

Design of systems

Self-sufficiency is taken into consideration when designing SRE systems. Systems are built to scale when necessary and provide high velocity while not compromising on policies and procedures.

Monitoring

The handbook defines monitoring as “collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.”

The principles guiding monitoring aim to design a monitoring system that is easy to interpret and does not cause unnecessary triggers that affect human activities.

The four signals that should be monitored include Latency, Traffic, Errors, and Saturation, and these signals are observed across different sections of the infrastructure.

Automation

Automation in SRE is done thoughtfully and is justified by the areas in which it is applied. To correctly explain automation:

It should be able to provide value that improves consistency
It should provide a platform that can be applied to multiple systems
It must be easy to repair
It should save time

Error budgets

When building a reliable flow, the SRE and development teams have different perspectives on work input and output. This would often lead to frictions, and coming together to define a flexible boundary around this is necessary. The terms defining this boundary includes.

The right amount of flexibility towards errors and unexpected events. Not too little, not too much
Balance in the required amount of tests
Reducing the amount of push done
Testing new releases on small sections of work, how often should this be done.

Defining these parameters provides a common goal for the teams involved, and provides an adequate balance between innovation and reliability.

Blameless post-mortem

These are documentation that is constructed after an incident has been handled. It is an integral part of the SRE principles and is used by the team to fix root problems when incidents keep repeating themselves.

Simplicity

In this chapter of SRE principles, the writer highlights software simplicity as a prerequisite for automation. Simplicity would involve building software systems that make production faster without infringing on the speed of innovation or restricting it.

When employing automation in SRE, it is beneficial that the focus of the software is stable, agile, and reliable. To do this, creating a simple code that is easy to find and fix when a bug surface is important.

DevOps and SRE

The major difference between the two concepts is how they relate to the different silos/teams within the organization. DevOps is practiced by breaking down the silos and having the whole team function as one, while the SRE is focused on building tools or systems that help the different teams function together.

The SRE principles are designed to reduce the risk of failure within the team and reduce the possibility of accidents and incidents with products. They encourage automation of processes just like DevOps and focus on minimizing the occurrence of manual work.

SRE principle measures productivity, availability, outages concerning the amount of toil put in and not just measuring every aspect of production as DevOps does.

They are more focused on internal processes within the organization, and they require contributions from all sectors within the organization, including management.

Reducing workload is an integral part of SRE functions. They find ways to remove repetitive work that workers do not like, e.g., submitting expenses. These repetitive tasks increase as the company expands and become cumbersome to the developers or engineers hence the need to be managed by a different team.

How to organize your SRE team.

In organizing an SRE team at your facility, you must first consider the types of SRE team implementations defined in this Google article and find the one best suited for your team.

Let’s take a look at them.

Kitchen sink SRE

In this type of SRE team, the area of operations in which SRE principles are to be applied is not specified. During the earliest days of SRE, this type was a more common practice. It is best suited for teams who need SRE practices but are too small to have multiple SRE teams in different silos.

Organizations that practice this type of SRE find it easy to identify patterns and similarities within the different sections of the organization. Communication is easily passed, and creating solutions between the teams is easier.

However, this type of SRE model is hard to maintain as the company grows. Larger organizations often mean, deeper problems, and relying heavily on one team would lead to shallow solutions and is not advised.

Infrastructure teams

SRE infrastructure teams are focused on making other teams work at their best. They maintain and improve shared infrastructures like Kubernetes clusters used by developers, security, and operations teams. They maintain cloud interactions for work pipelines like CI/CD that are built on public clouds.

They are best suited for organizations with multiple development teams. They define common standards and focus solely on improving reliability by providing the company with the best reliability practices.

They improve production standards and keep other teams up to date on what has been done. They simplify production and delivery processes. They, however, might not offer solutions that improve customer experience as they are not in direct contact with the customer. This type of practice is often augmented with a more invasive form of SRE like Product or Embedded SRE.

SRE tools teams

SRE tools team is very similar in function to an infrastructure team. However, it is a tools-only team. They focus on building software that helps other teams and create systems that support and planning reliability in an organization. They receive more direct feedback than the infrastructure teams, and hence they function better.

They are more focused on automating processes; hence they receive accurate metrics that measure progress and functionality.

Organizations that benefit the most from tooling SRE teams are large organizations that require lots of automation that aren’t currently available as a service. They provide the sort of structure that would benefit fast-growing teams. However, this tends to lead to team members being overworked.

Product/Application SRE teams

This type of SRE team is focused on improving the output of an organization. They are structured to provide services that improve the organization’s business.

They provide a clear focus for all the team’s efforts. They direct all of the team’s energy towards making decisions that directly profit the organization. They are often used as a secondary SRE team by companies who already have any of the SRE teams discussed above.

Embedded SRE teams

SRE teams that function with this structure usually have at least one member in the organization’s major production or development department. Most times, they are on the team for the project’s duration and are very involved in steering activities toward a particular direction. This mode of SRE is very effective and helps the teams solve specific issues very quickly.

Companies looking to introduce SRE implementation into their workflow would benefit the most from this type of SRE implementation. Also, organizations that do not need an around-the-clock SRE team would benefit from this type of implementation. When combined with other SRE teams, they help improve the adoption of reliability processes.

Consulting SRE implementation teams

This SRE team functions similarly to an embedded team; however, they do not modify the production processes directly. They function as part of the other teams by building tools that assist production. They are a hybrid of the embedded team and tooling team.

They increase SRE practices of the organization, but they might not have a sufficient impact on the company’s production flow. They are sometimes regarded as being hands-off and having no direct impact. They are beneficial to large organizations with large SRE implementation teams needing closer support to drive implementation in other departments.

Irrespective of which SRE team you wish to build, it is advised that the depth of engagement is specified. In organizations with multiple development and production teams, having multiple SRE teams with different levels of involvement would be best.

Wrapping everything up

A Site Reliability Engineer is usually someone multidisciplinary. SRE implementation teams are in consequence, cross-functional teams that have expertise in development, operations, observability, and security. The main role of SRE teams is to make the enterprise platform reliable.

Reliability, like said, involves different disciplines, however, this is not enough, as cultural aspects are as important as technical skills. The collaboration culture, substituting automation for human labor philosophy, and the mindset of reliability are important things to consider when hiring SREs.

However, it is worth noting that forming teams with the required culture and skills is a complex task. This is the reason why most companies focus their efforts on building their product and outsource everything about reliability to cloud computing providers. With the rise of managed services, no-code, and no-ops tools, many of the SRE tasks are outsourced to platforms a development team can use.

Wildcard is one of these platforms. The wildcard is a NoCode platform that provides a solution to help organizations, and developers, even those without DevOps and SRE experience or coding knowledge, to successfully implement reliability and stability practices and build, deploy, and manage applications without writing a single line of code.

Start for free by singing using Github or GitLab.