The Essential List of Top SRE Resources
Originally published on Failure is Inevitable.
Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!
The big books
These comprehensive tomes of SRE expertise are a great place to start.
Google provides an overview of SRE implementation, covering the guiding principles that led the organization-wide adoption of SRE, and detailing practices ranging from upper-level management to the nuances of load balancing.
Offered by Blameless, this eBook guides you through implementing your own SRE solution and is centered around three key principles: creating a mindset of resiliency, reducing engineering problems and innovation blockers, and approaching systems from a human perspective. If you’re looking to see how SRE will work within your organization, this eBook provides solutions that are not one-size-fits-all which you can begin implementing today..
This O’Reilly textbook offers the most comprehensive dive into the inner workings of an SRE solution, covering everything from the fundamental theories of SRE to a breakdown of work-as-done. A companion book, The Site Reliability Handbook, provides illustrative case studies.
Site Reliability Engineering Tools
A variety of tools have been developed to help you on your SRE journey. These guides will help you decide what best fits your needs.
Offered by Blameless, this guide looks at the goals of a successful SRE solution, and discusses what features a tool should have to accomplish them. It also breaks down the pros and cons of building tooling yourself, purchasing a tool, or adapting an open-source tool.
Curated by SREs, this list of tools is sorted by functions to help you find vendors who provide services ranging from project management tools to infrastructure and container orchestration..
This article looks at a complete cycle of development and operations and breaks down how SRE tooling could help DevOps teams at each stage.
This talk by engineers at VictorOps, Grafana, and Influxdata outlines what an SRE toolchain could look like and how to experiment with options to build a solution.
Hiring Site Reliability Engineers: Why You Need an SRE
Thinking about staffing an SRE team? Having dedicated engineers working on the long view of reliability problems is a worthy investment in your reliability. But how can you find good SREs, and what should they be doing? These articles and talks will answer these questions and more.
This SREcon talk given by Andrew Fong breaks down how Dropbox hired its SRE team, covering everything from sourcing talent to interviewing rubrics.
This guide explains the importance of investing in reliability staff and outlines how to find the perfect candidate for your first SRE role.
Andrew Widdowson outlines Google’s recommendations for training your new SRE hires. Techniques such as learning opportunities, systems thinking, and imparting the philosophy of SRE help you get your team up and running.
Becoming a Certified SRE
Are you looking to step into the exciting role of SRE? These links will help you find site reliability engineering certifications and other learning opportunities.
This guide provides a concise spreadsheet of online courses in SRE topics. It builds up the SRE role from fundamental skills in Linux system administration and software development, making it the perfect guide for someone starting their career.
Created by the Google Cloud team, this course covers the Google SRE book in an engaging guided format. Quizzes and short assignments reinforce your learning, with an optional paid certification for completion.
Site Reliability Engineering Philosophy and Culture
SRE isn’t just a set of practices and tools. The underlying philosophies of SRE motivating these practices are fundamental to making your organization truly resilient. These articles and blogs will help you embrace failure as inevitable, put aside blame, develop for resiliency, and more.
This article looks at the different ways SRE can be implemented and the benefits of each on both practical and cultural levels.
What exactly is the difference between DevOps and SRE? How do you incorporate the practices of each? This presentation by Google will answer these questions and more.
This talk by Blameless co-founder Lyon Wong provides strategies for getting SRE buy-in at the level of management, VPs, and CTOs. You can also read a series of blog posts covering the topic here: management, VP level, CTO level.
This weekly newsletter curated by Lex Neva, SRE at Fastly, brings you the latest in case studies, think pieces, and SRE news.
Many links in this list were sourced from the Awesome Site Reliability Resources page. Check it out if you’d like further resources for any of these topics, or there are other areas of SRE you’d like to explore.
Get similar stories in your inbox weekly, for free
Share this story:
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …