How to Scale End-to-End Observability in AWS Environments

The Essential List of Top SRE Resources

The Essential List of Top SRE Resources.png

    Originally published on Failure is Inevitable.

    Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!

    The big books

    These comprehensive tomes of SRE expertise are a great place to start.

    Google’s SRE Book

    Google provides an overview of SRE implementation, covering the guiding principles that led the organization-wide adoption of SRE, and detailing practices ranging from upper-level management to the nuances of load balancing.

    The Essential Guide to SRE Best Practices

    Offered by Blameless, this eBook guides you through implementing your own SRE solution and is centered around three key principles: creating a mindset of resiliency, reducing engineering problems and innovation blockers, and approaching systems from a human perspective. If you’re looking to see how SRE will work within your organization, this eBook provides solutions that are not one-size-fits-all which you can begin implementing today..

    Site Reliability Engineering

    This O’Reilly textbook offers the most comprehensive dive into the inner workings of an SRE solution, covering everything from the fundamental theories of SRE to a breakdown of work-as-done. A companion book, The Site Reliability Handbook, provides illustrative case studies.

    If you’re more pressed for time, Principal Developer Advocate for Honeycomb Liz Fong-Jones offers a playlist of essential O’Reilly SRE resources.

    Site Reliability Engineering Tools

    A variety of tools have been developed to help you on your SRE journey. These guides will help you decide what best fits your needs.

    Blameless Buyers’ Guide for Reliability

    Offered by Blameless, this guide looks at the goals of a successful SRE solution, and discusses what features a tool should have to accomplish them. It also breaks down the pros and cons of building tooling yourself, purchasing a tool, or adapting an open-source tool.

    Awesome Site Reliability Tools

    Curated by SREs, this list of tools is sorted by functions to help you find vendors who provide services ranging from project management tools to infrastructure and container orchestration..

    The Best SRE Tools

    This article looks at a complete cycle of development and operations and breaks down how SRE tooling could help DevOps teams at each stage.

    Choosing the Right Tools when Building Your SRE Toolchain

    This talk by engineers at VictorOps, Grafana, and Influxdata outlines what an SRE toolchain could look like and how to experiment with options to build a solution.

    Hiring Site Reliability Engineers: Why You Need an SRE

    Thinking about staffing an SRE team? Having dedicated engineers working on the long view of reliability problems is a worthy investment in your reliability. But how can you find good SREs, and what should they be doing? These articles and talks will answer these questions and more.

    SRE Hiring

    This SREcon talk given by Andrew Fong breaks down how Dropbox hired its SRE team, covering everything from sourcing talent to interviewing rubrics.

    Hiring Your First SRE

    This guide explains the importance of investing in reliability staff and outlines how to find the perfect candidate for your first SRE role.

    From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams

    Andrew Widdowson outlines Google’s recommendations for training your new SRE hires. Techniques such as learning opportunities, systems thinking, and imparting the philosophy of SRE help you get your team up and running.

    Becoming a Certified SRE

    Are you looking to step into the exciting role of SRE? These links will help you find site reliability engineering certifications and other learning opportunities.

    Kubedex - How do I become a SRE?

    This guide provides a concise spreadsheet of online courses in SRE topics. It builds up the SRE role from fundamental skills in Linux system administration and software development, making it the perfect guide for someone starting their career.

    Site Reliability Engineering: Measuring and Managing Reliability on Coursera

    Created by the Google Cloud team, this course covers the Google SRE book in an engaging guided format. Quizzes and short assignments reinforce your learning, with an optional paid certification for completion.

    Site Reliability Engineering Philosophy and Culture

    SRE isn’t just a set of practices and tools. The underlying philosophies of SRE motivating these practices are fundamental to making your organization truly resilient. These articles and blogs will help you embrace failure as inevitable, put aside blame, develop for resiliency, and more.

    The Many Shapes of Site Reliability Engineering

    This article looks at the different ways SRE can be implemented and the benefits of each on both practical and cultural levels.

    SRE vs. DevOps

    What exactly is the difference between DevOps and SRE? How do you incorporate the practices of each? This presentation by Google will answer these questions and more.

    Convincing Management to Invest in Reliability

    This talk by Blameless co-founder Lyon Wong provides strategies for getting SRE buy-in at the level of management, VPs, and CTOs. You can also read a series of blog posts covering the topic here: management, VP level, CTO level.

    SRE Weekly

    This weekly newsletter curated by Lex Neva, SRE at Fastly, brings you the latest in case studies, think pieces, and SRE news.

    Everything Else

    Many links in this list were sourced from the Awesome Site Reliability Resources page. Check it out if you’d like further resources for any of these topics, or there are other areas of SRE you’d like to explore.

    If you’d like to learn more about SRE and how to begin employing best practices in your organization, feel free to reach out to us for a demo or try us out for free.


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …