How to Scale End-to-End Observability in AWS Environments

LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations

    Kurt Andersen is an engineer who is fascinated by how entire systems interrelate. Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on how to make hundreds of constantly moving parts work together. Blameless interviewed Kurt to shine light on the blind spots that companies often have when implementing SRE.

    Besides his role as a senior staff site reliability engineer at LinkedIn, Kurt is also sitting on the board of USENIX, an organization that hosts a wealth of conferences that bring together top professionals in the computing world, including SREcon.

    Here are the key nuggets of SRE wisdom from Kurt in the interview.

    SRE = available + secure

    Availability has the main spotlight whenever people explain the purpose of Site Reliability Engineering. However, LinkedIn shares the spotlight with an additional emphasis: security. The SRE team at LinkedIn works to keep the site available and secure. Data privacy and integrity are top priorities to LinkedIn’s SRE team.

    The SRE team at LinkedIn works to keep the site availableandsecure.

    Differentiating DevOps vs. SRE

    Many DevOps engineers are still convincing their organizations the value of continuous integration (CI) and continuous delivery (CD). CI and CD are designed to do things faster, but that does not always mean doing the right things.

    SRE teams focus on business success. Organizations with SRE teams tend to already have CI/CD as a staple, rather than a source of resistance. SRE builds on top of CI/CD and ensures that whatever moves fast contribute to business success. (See chapter 22 in the book Seeking SRE for detailed explanations from Kurt.)

    Key Success Factor to SRE

    Culture. A blameless culture is one that encourages learning and continuous improvement.

    Feature Developers’ Blindspot: Retirement of their Services

    Most feature developers don’t plan for retirement of features. Microservices gives you the illusion that you can yank and replace, but that’s not really the case. It’s tough to turn off a microservice without losing an arm or leg. That’s why it’s important for SREs to have a full life cycle engagement, providing input starting from the design phase, so we can avoid the high cost of fixing bugs (and retiring features) later. When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.

    When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.

    Terminology Confusion: SLO or SLA?

    For companies that do not suffer financial penalties for violating Service Level Agreements (SLA), the internal engineering team tends to use SLO and SLA interchangeably. SLO, service level objective, is really an internal metric for services that depends on another service. Distinguishing the two will help with communications clarity when SLA does become important (or tied to dollar amount penalties).

    Coming Up with Meaningful SLOs - a Missing Protocol

    How would you come up with the best and most reasonable SLO for availability, latency (site speed), error rate, performance relative to traffic load, or how a service performs under stress conditions?

    You can’t, not at the beginning. It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.For example, at Home Depot, SLOs are reviewed every 6 months. Teams can revise to have tighter or looser SLOs (E.g. Going from 99% availability to 99.5% or 98%). Each team at an organization can review their SLOs at a tempo that works that them. The key is to have a regular means to adjust rather than signing a lifelong commitment. (See chapter 3 in The Site Reliability Workbook for more details.)

    It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.

    SLO Challenge: Measuring the Business Impact of Grey Failures

    A grey failure refers to partial failure of a system, for example, if a specific feature of LinkedIn were to stop responding only in Canada. Calculating the impact of a grey failure is difficult. The estimates are rough, the process is manual, and it’s difficult to take into account any bounce back effect. When Amazon Prime went down on Prime day, possibly more customers came back the next day to buy more, however, it’s also possible that what customers wanted to buy had already been sold out. Because it’s difficult to quantify the business impact, we currently bucket impact into 3 categories: minor, major, and critical; and prioritize accordingly.

    Vision for SRE

    SRE brings ongoing emphasis and continual drumbeat on the importance of reliability, like what QAs do for unit testing. In an ideal world, every engineer will take reliability into account for everything they do.

    Written by Charlie Taylor


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …