So you Want an SRE Tool. Do you Build, Buy, or Open Source?
Will you buy an out-of-the-box tool, build one in-house, or work with an open source project? We’ll help you decide which solution is your best fit by breaking down the pros and cons.
By: Emily Arnott, Failure is Inevitable
As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project?
This is a big decision. Switching methods half-way through adoption is costly and can cause thrash. You’ll want to determine which method is the best fit before taking action. Each choice requires a different type of investment and offers different benefits. We’ll help you decide which solution is your best fit by breaking down the pros and cons. In this blog post, we’ll cover:
- Types of SRE tools available
- Why SRE tools can be helpful
- Pros and cons of buying, building, and open sourcing
Types of SRE Tools
There are a wealth of SRE tools available to you for different areas of SRE best practices. Let’s look at some of the most common categories:
- Monitoring and observability tools: collect and present information about your services
- SLOs and error budgeting tools: monitor your service level objectives for a holistic view of your reliability
- Alerting tools: inform designated people when monitoring detects an issue with your service
- Runbook tools: document, execute, and automate step-by-step guides
- Incident management tools: automate toil and improve communication through chatbots, checklists, and data aggregation
- Incident retrospective tools: create documents that record what you learned in an incident as well as follow-up tasks
- Chaos engineering tools: create and execute chaos experiments that simulate failure in your system
While all of these tools help operationalize some facet of SRE best practices, an end-to-end solution incorporates many of these tools and helps them talk to one another. A very important consideration when choosing between building, buying, or open sourcing an SRE tool, is how these moving parts connect. You want to ensure that all the pieces in your ecosystem are working together.
Why SRE tools can be helpful
Some SRE tools allow you to automate parts of your reliability process. The level of automation will be affected by the tool itself. Alerting tools automate notifying incident participants. Incident management tools automate coordination during incident response, as well as data aggregation. Runbook tools automate repetitive routine processes. End-to-end solutions automate many of these processes, and more. Automation reduces toil, freeing up time and energy for more nuanced tasks.
Learning new SRE processes can be daunting. SRE tools guide you through tasks like setting SLOs, building runbooks, and creating incident retrospectives. Following these guidelines reduces the cognitive toil and improves process consistency.
Tooling can help make data about your services more actionable. The reports you generate are consistent and codified, helping you identify patterns. Monitoring tools highlight and consolidate the most important information.
An ideal SRE tool would also help you make the most of your other tools. By connecting monitoring and alerting to your incident management, or syncing your retrospectives to your ticketing system, you save time and have a more holistic view of all the individual parts of your system. This makes it easier for teams to make decisions, resolve incidents, and prioritize development. In short, an ideal SRE tool would help you throughout your entire software development life cycle rather than just distinct portions of it.
Of course, SRE tools can make your service more reliable, too. By using these tools to codify SRE best practices, establish processes and guardrails, and automate repetitive tasks, you set your teams up for success. These best practices bolster reliability and encourage a blameless culture. Embracing, responding to, and learning from failure makes teams and systems stronger.
Buying, building, or open-sourcing
Now that you’ve seen what tools are available to you, you’ll need to decide how to adopt them. Your major choices are buying an out-of-the-box solution, building your own solution, or adapting an open source solution. Each has pros and cons.
Pros of buying your SRE tool
Reliability: Your vendor will maintain your solution. You’ll have an agreement guaranteeing standards of availability and security. Your vendor will have support processes to ensure your tool works as expected. This will save you from having to devote resources to keeping the tool functional.
Expanding functionality: The vendor will continue to develop the functionality of the tool. Without needing to invest your own resources, the tool will grow more useful and efficient. You can also provide your own input with feature requests to drive the product in the direction you’d like.
Built-in integrations: The vendor will likely have a variety of integrations available to source information from or communicate with. By not having to create these integrations on your own, you save time and money. Additionally, this ensures that the other tools you use can all speak with one another rather than operating in silos.
Cons of buying your SRE tool
Costs: Tooling can be costly up front, and difficult to get budget for. Buying a solution is still often less costly than building your own or maintaining an open source solution. Yet the costs of building or maintaining may not be bubbled up.
Pros of building your SRE tool
Customizability: Because you’ll be building it to meet your specific needs, this product will fit your needs like a glove. If your needs change, you’ll have full access to adapt your tool.
Cons of building your SRE tool
Opportunity cost: Time spent working on the tool is time not spent working on other projects. You may have to devote people full-time to maintaining and upgrading the tool.
Responsibility: If the tool breaks, it’s up to you to fix it. If other internal services rely on your tool, this could mean establishing internal SLAs. Also, tribal knowledge can become a problem. If documentation becomes out of date or team members change organizations, crucial knowledge can be lost.
Building complex integrations: You will need to spend time making sure that your SRE tool is able to ingest information from your alerting and monitoring tools and send information to ticketing systems and more. These capabilities can be difficult to build out, especially for teams with other development priorities.
Pros of open sourcing your SRE tool
Cost efficiency: Open source tools are often free to use. There will be some opportunity cost to implementing the tool in your specific environment, but it can be minimal.
Adaptability: As you have access to the source code of your tool, you’ll have the flexibility to add the features you need. But, this could have a large opportunity cost.
Cons of open sourcing your SRE tool
Security concerns: As everyone will have access to the source code of the tool, security issues could emerge. The community behind the tool will make efforts to secure the tool and fix issues, but the responsibility will be yours.
Maintenance and improvements: Updates and fixes for the tool will come from the development community. As these projects are often projects community members take on outside of work, there could be a longer wait for improvements or additional integrations.
Here’s a chart summarizing the pros and cons of each option:
For more analysis on how to choose and acquire SRE tools, check out our buyers’ guide.
If you’ve decided to purchase an SRE tool, check out what Blameless has to offer. Sign up for a demo!
Get similar stories in your inbox weekly, for free
Share this story:
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …