A Journey Through Blameless from Incident to Success
Written by: Dyllen Owens
Here at Blameless, every aspect of our product has SLOs (Service Level Objects) and error budgets in order to help us understand and improve customer experience. Sometimes, these error budgets are at risk, triggering an incident. While incidents are often painful, we treat them as unplanned investments, striving to learn as much as we can from them. We empower all of our engineers to handle an on-call rotation, no matter how difficult the issue. I’d like to share how SLOs have helped our team resolve a particular incident that resulted in a huge product improvement.
One day, an unlucky individual was bombarded with error budget exhaustion alerts surrounding the settings of our application. After investigating the immediate incident we identified that the settings page needed further engineering triage. A huge amount of our error budget was being burned within our settings by the worst part of the internet, latency. Our pages were loading an initial first load at 10x slower than our target goals. This was entirely unacceptable.
Since our services communicate with more than 13 third-party services, we figured that our problem was how we communicated with these services in the process of populating our settings with customization options.
When we originally built settings, the page was light and minimal. The number of integrations we supported was nowhere near the number we support now. As time went on, and our product matured, the number of things we wanted to empower our customers to do grew. Due to this expansion, we bolted on more code without considering if it was the right approach. This meant doing things like sending payloads that increasingly exceeded what people would have considered standard, hence the incident with 10 second wait times on page loads. We immediately started off on finding areas within settings that we could improve.
We created an incident within our platform around the high-latency error budget reduction. Then we identified all areas that were key points we wanted to address and used our followup feature to start creating Jira tickets. Here were some of the initial steps we took:
- The first area was data parallelization loading.
- The second was response reduction to provide the front end with only the most necessary data.
- Then we focused on offloading secondary requests until after the page rendered.
After we identified all of the areas within our product we could improve, we were off to the races.
We created tickets and allocated resources. With our Zoom integration, my backend team member and I set up a war room we could drop into as we worked through the incident. As we addressed each problem, we validated that our improvements were mitigating our original incident. We quickly iterated by testing response times and setting up new error budgets on our staging environments.
After 80+ engineering hours, we finally resolved all the tickets that were uncovered from our investigation into the incident. We improved loading our heaviest settings page with five dynamic integrations by 10x through an initial discovery of an exhausted error budget within our platform.
Our goal was to improve the speed of the initial page render and load times for our network request, but ultimately we did more. We completely cleaned up how this data within our product looked. We wanted to future-proof this section of our applications to create a framework where, as more integrations are added, the initial issue won’t occur again. This also naturally lent itself to a framework built for further improvements.
We revisited the incident at a resolved state, and took the time to conduct a thorough postmortem to analyze how our product slid this direction. Through our collaborative editing functionality, we quickly outlined what the problem was. Then we covered how we addressed this problem, and finally discussed future strategies to find bottlenecks like this earlier in our code reviews.
After this incident, we created more aggressively targeted SLOs and error budgets to further track this issue through the customer journey. Using Blameless, we identified our critical customer issue, created tickets for tracking progress through our SCRUM process, orchestrated an area for collaboration between my backend engineer, our product owners, and me, and finally resolved the issue with indexable information on how we as a team will improve our processes to ensure that our product becomes hardened through our growth.
As a recap, here are the ways we used Blameless to manage and learn from this incident:
- SLOs and error budgets to understand customer experience
- Incident resolution to determine key aspects of our product that needed to be improved and communicate easily throughout the process
- Collaborative postmortem to determine processes that would allow us to discover bottlenecks faster
If you liked this, check out the following related resource:
Get similar stories in your inbox weekly, for free
Share this story:
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …