The Essential List of Top SRE Resources
Originally published on Failure is Inevitable.
Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!
The big books
These comprehensive tomes of SRE expertise are a great place to start.
Google provides an overview of SRE implementation, covering the guiding principles that led the organization-wide adoption of SRE, and detailing practices ranging from upper-level management to the nuances of load balancing.
The Essential Guide to SRE Best Practices
Offered by Blameless, this eBook guides you through implementing your own SRE solution and is centered around three key principles: creating a mindset of resiliency, reducing engineering problems and innovation blockers, and approaching systems from a human perspective. If you’re looking to see how SRE will work within your organization, this eBook provides solutions that are not one-size-fits-all which you can begin implementing today..
This O’Reilly textbook offers the most comprehensive dive into the inner workings of an SRE solution, covering everything from the fundamental theories of SRE to a breakdown of work-as-done. A companion book, The Site Reliability Handbook, provides illustrative case studies.
If you’re more pressed for time, Principal Developer Advocate for Honeycomb Liz Fong-Jones offers a playlist of essential O’Reilly SRE resources.
Site Reliability Engineering Tools
A variety of tools have been developed to help you on your SRE journey. These guides will help you decide what best fits your needs.
Blameless Buyers’ Guide for Reliability
Offered by Blameless, this guide looks at the goals of a successful SRE solution, and discusses what features a tool should have to accomplish them. It also breaks down the pros and cons of building tooling yourself, purchasing a tool, or adapting an open-source tool.
Awesome Site Reliability Tools
Curated by SREs, this list of tools is sorted by functions to help you find vendors who provide services ranging from project management tools to infrastructure and container orchestration..
This article looks at a complete cycle of development and operations and breaks down how SRE tooling could help DevOps teams at each stage.
Choosing the Right Tools when Building Your SRE Toolchain
This talk by engineers at VictorOps, Grafana, and Influxdata outlines what an SRE toolchain could look like and how to experiment with options to build a solution.
Hiring Site Reliability Engineers: Why You Need an SRE
Thinking about staffing an SRE team? Having dedicated engineers working on the long view of reliability problems is a worthy investment in your reliability. But how can you find good SREs, and what should they be doing? These articles and talks will answer these questions and more.
This SREcon talk given by Andrew Fong breaks down how Dropbox hired its SRE team, covering everything from sourcing talent to interviewing rubrics.
This guide explains the importance of investing in reliability staff and outlines how to find the perfect candidate for your first SRE role.
From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
Andrew Widdowson outlines Google’s recommendations for training your new SRE hires. Techniques such as learning opportunities, systems thinking, and imparting the philosophy of SRE help you get your team up and running.
Becoming a Certified SRE
Are you looking to step into the exciting role of SRE? These links will help you find site reliability engineering certifications and other learning opportunities.
Kubedex - How do I become a SRE?
This guide provides a concise spreadsheet of online courses in SRE topics. It builds up the SRE role from fundamental skills in Linux system administration and software development, making it the perfect guide for someone starting their career.
Site Reliability Engineering: Measuring and Managing Reliability on Coursera
Created by the Google Cloud team, this course covers the Google SRE book in an engaging guided format. Quizzes and short assignments reinforce your learning, with an optional paid certification for completion.
Site Reliability Engineering Philosophy and Culture
SRE isn’t just a set of practices and tools. The underlying philosophies of SRE motivating these practices are fundamental to making your organization truly resilient. These articles and blogs will help you embrace failure as inevitable, put aside blame, develop for resiliency, and more.
The Many Shapes of Site Reliability Engineering
This article looks at the different ways SRE can be implemented and the benefits of each on both practical and cultural levels.
What exactly is the difference between DevOps and SRE? How do you incorporate the practices of each? This presentation by Google will answer these questions and more.
Convincing Management to Invest in Reliability
This talk by Blameless co-founder Lyon Wong provides strategies for getting SRE buy-in at the level of management, VPs, and CTOs. You can also read a series of blog posts covering the topic here: management, VP level, CTO level.
This weekly newsletter curated by Lex Neva, SRE at Fastly, brings you the latest in case studies, think pieces, and SRE news.
Many links in this list were sourced from the Awesome Site Reliability Resources page. Check it out if you’d like further resources for any of these topics, or there are other areas of SRE you’d like to explore.
If you’d like to learn more about SRE and how to begin employing best practices in your organization, feel free to reach out to us for a demo or try us out for free.
Get similar stories in your inbox weekly, for free
Share this story:
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.
The all-in-one monitoring solution for IT admins, DevOps and SREs
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …
IT Monitoring Powered by AIOps
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
A Review of Zoho ManageEngine
Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …
Should I learn Java in 2023? A Practical Guide
Java is one of the most widely used programming languages in the world. It has …
The fastest way to ramp up on DevOps
You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …
Why You Need a Blockchain Node Provider
In this article, we briefly cover the concept of blockchain nodes provider and explain why …
Top 5 Virtual desktop Provides in 2022
Here are the top 5 virtual desktop providers who offer a range of benefits such …
Why Your Business Should Connect Directly To Your Cloud
Today, companies make the most use of cloud technology regardless of their size and sector. …