A Journey Through Blameless from Incident to Success
Written by: Dyllen Owens
Here at Blameless, every aspect of our product has SLOs (Service Level Objects) and error budgets in order to help us understand and improve customer experience. Sometimes, these error budgets are at risk, triggering an incident. While incidents are often painful, we treat them as unplanned investments, striving to learn as much as we can from them. We empower all of our engineers to handle an on-call rotation, no matter how difficult the issue. I’d like to share how SLOs have helped our team resolve a particular incident that resulted in a huge product improvement.
The incident
One day, an unlucky individual was bombarded with error budget exhaustion alerts surrounding the settings of our application. After investigating the immediate incident we identified that the settings page needed further engineering triage. A huge amount of our error budget was being burned within our settings by the worst part of the internet, latency. Our pages were loading an initial first load at 10x slower than our target goals. This was entirely unacceptable.
Since our services communicate with more than 13 third-party services, we figured that our problem was how we communicated with these services in the process of populating our settings with customization options.
When we originally built settings, the page was light and minimal. The number of integrations we supported was nowhere near the number we support now. As time went on, and our product matured, the number of things we wanted to empower our customers to do grew. Due to this expansion, we bolted on more code without considering if it was the right approach. This meant doing things like sending payloads that increasingly exceeded what people would have considered standard, hence the incident with 10 second wait times on page loads. We immediately started off on finding areas within settings that we could improve.
The process
We created an incident within our platform around the high-latency error budget reduction. Then we identified all areas that were key points we wanted to address and used our followup feature to start creating Jira tickets. Here were some of the initial steps we took:
- The first area was data parallelization loading.
- The second was response reduction to provide the front end with only the most necessary data.
- Then we focused on offloading secondary requests until after the page rendered.
After we identified all of the areas within our product we could improve, we were off to the races.
We created tickets and allocated resources. With our Zoom integration, my backend team member and I set up a war room we could drop into as we worked through the incident. As we addressed each problem, we validated that our improvements were mitigating our original incident. We quickly iterated by testing response times and setting up new error budgets on our staging environments.
After 80+ engineering hours, we finally resolved all the tickets that were uncovered from our investigation into the incident. We improved loading our heaviest settings page with five dynamic integrations by 10x through an initial discovery of an exhausted error budget within our platform.
Our goal was to improve the speed of the initial page render and load times for our network request, but ultimately we did more. We completely cleaned up how this data within our product looked. We wanted to future-proof this section of our applications to create a framework where, as more integrations are added, the initial issue won’t occur again. This also naturally lent itself to a framework built for further improvements.
The resolution
We revisited the incident at a resolved state, and took the time to conduct a thorough postmortem to analyze how our product slid this direction. Through our collaborative editing functionality, we quickly outlined what the problem was. Then we covered how we addressed this problem, and finally discussed future strategies to find bottlenecks like this earlier in our code reviews.
After this incident, we created more aggressively targeted SLOs and error budgets to further track this issue through the customer journey. Using Blameless, we identified our critical customer issue, created tickets for tracking progress through our SCRUM process, orchestrated an area for collaboration between my backend engineer, our product owners, and me, and finally resolved the issue with indexable information on how we as a team will improve our processes to ensure that our product becomes hardened through our growth.
As a recap, here are the ways we used Blameless to manage and learn from this incident:
- SLOs and error budgets to understand customer experience
- Incident resolution to determine key aspects of our product that needed to be improved and communicate easily throughout the process
- Collaborative postmortem to determine processes that would allow us to discover bottlenecks faster
If you liked this, check out the following related resource:
Get similar stories in your inbox weekly, for free
Share this story:
Blameless
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.
Latest stories
How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …
A Review of Zoho ManageEngine
Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …
Should I learn Java in 2023? A Practical Guide
Java is one of the most widely used programming languages in the world. It has …
The fastest way to ramp up on DevOps
You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …
Why You Need a Blockchain Node Provider
In this article, we briefly cover the concept of blockchain nodes provider and explain why …
Top 5 Virtual desktop Provides in 2022
Here are the top 5 virtual desktop providers who offer a range of benefits such …
Why Your Business Should Connect Directly To Your Cloud
Today, companies make the most use of cloud technology regardless of their size and sector. …
7 Must-Watch DevSecOps Videos
Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …