How to Scale End-to-End Observability in AWS Environments

Blameless' SRE Journey

Blameless' SRE Journey.png

This is our story of how our SRE practice grew at Blamelss, including insights on testing, incidents, on call, and more.


    SRE is a practice adopted by best-in-class companies all over the world. As a software reliability platform purpose-built for SREs, Blameless strives to practice what we preach and utilizes SRE best practices daily to cultivate a culture of resilience.

    However, this wasn’t always the case. In the early days of our company’s history (like many other companies at the beginning of their journeys), we often needed to move fast without looking through the lens of reliability and prioritize feature development and product-market fit over scalability and resilience. As you can imagine, this isn’t sustainable, and needed to make a change.

    In this post, we will share our SRE journey and how we operationalized the best practices we hold dear.

    The initial pain

    Blameless’ founding goal was to build a set of features that would solve a real customer pain point in an unprecedented way. We needed to move rapidly to accomplish this and made architectural and product decisions to optimize for this goal. This resulted in a significant amount of technical debt, a common issue for a high-growth startup. The engineering team was bogged down with reactive, unplanned work, and did not have sufficient capacity to fix these issues to increase reliability and stability. Our engineering team was burning out, and our customers were also unhappy.

    Something needed to change, so CEO Ashar Rizqi halted all current feature development. In Ashar’s words, “You can't improve what you can't measure. We needed to objectively prove that we are a reliable platform because that's core to establishing trust in our customers. Vulnerability is the most important thing. The second is transparency.”

    This change allowed the engineering team to invest their efforts in fixing technical debt and reliability. At this point, Blameless didn’t have an SRE program or team, so we decided it was time for us to become customer zero of our own product.

    Practice makes perfect

    Blameless as customer zero meant that all new features were vetted by us for two weeks before we began to roll them out to customers. Any incident that arose within the Blameless instance was treated as a customer incident, and also processed with Blameless ⁠— all the way from the incident management capabilities to the postmortem. We also applied the same standards and decision making for Blameless as we would for our other customers. Key changes due to this new way of thinking included:

    • Setting up our own SLOs inside Blameless using integrations like Prometheus, and watching our dashboard.
    • Weekly operational review meetings where we looked at key user journeys, the associated SLOs, and the SLO statuses.
    • Setting error budget policies for those SLOs and tracking them.
    • Getting buy-in from respective component owners to commit to changing their sprints if we violated our error budget.
    • Mandating that any regression from customer expectations would be considered a high severity incident requiring immediate attention.

    We also began setting KPIs for both the software development and SRE functions, such as number of production deployments, how many lines of code are changed, commits per deployment, and number of regressions (which were prioritized in operational review). These changes required a big divergence from how our teams were structured and operating previously.

    Taking the questioning out of QA & testing

    In addition to setting KPIs and becoming our own customer zero, we knew we needed a better system for QA in order to provide our customer with the reliability they expect. We set up a formal quality assurance program internally. We built out the QA team, whose job it was to make sure that we hit a certain quality bar for our product. And, of course, we set up another team whose job it was to automate toil-heavy QA processes.

    In the past, developers would write a piece of code, but the code was tested by a manual QA tester. Someone would write code, merge it into the DEV branch and the main branch, and wait a week for someone from manual QA to run an end-to-end test. Then, if QA found an issue, the team would open a ticket, adding a delay.

    Ashar and the team decided to move away from this process. Ashar stated, “If you are building the feature, and you are writing the code, then you are the one who will be held accountable for the quality of that code.”

    Our team succeeded in making this change, and we were excited by the results. We started moving drastically faster as developers began finding errors before turning in code to QA, eliminating the lengthy turnaround. Additionally, the manual QA team was freed up to automate away toil and focus on more important testing.

    Incidents reimagined

    Another milestone in our reliability journey was reimagining our incident management and postmortem processes, and setting new KPIs. Rather than focusing on resolution time, we began looking at time to action. However, Blameless was not yet equipped with this capability. As customer zero, we took our own requests seriously, and this resulted in the “Check-In” button. This button prompts incident participants to check into the war room. The time between the beginning of the incident and check-in became our measurement for time to action.

    What this feature enables us to do is to start looking at how long it takes us to respond to an incident, and where the gaps are occurring if the response time is too long. Our KPIs for this time range are:

    • <5 minutes for SEV 0
    • <30 minutes for SEV 1

    Beyond incident management, we also set KPIs for our new postmortem process. At Blameless, for every production incident, we require a 100% completion rate for the postmortem survey and 100% completion rate for the resulting action items. Filling out the survey, in particular, has a strict SLA around it. Rather than spending more time on creating postmortems (especially for minor incidents), we created our survey function in Blameless which is highly customizable and hones in on the key questions. We put our survey responses into our big data analytics product that bubbles up key insights quickly to inform engineering decision making. This also helps to streamline the process of starting to build a narrative. As Ashar states, "The writeup can always happen later, but we want a 100% completion rate on this survey while the memory is still fresh.”

    Empowering on-call

    Prior to our on-call overhaul, there was a lot of anxiety surrounding carrying the pager. Developers didn’t feel empowered or knowledgeable enough to respond and troubleshoot effectively during an incident. This was the signal that our on-call protocols needed to change. The team began to train for on-call, and created runbooks within the Blameless platform. The difference was night and day, and now all engineers are part of the on-call rotation.

    “The idea there was to give our team members encouragement that they can own the troubleshooting of their services, including infrastructure,” Ashar said. The team, led by Moiz Virani, also implemented better practices for the documentation and handoffs for the on-call process. Now, the on-call staff member creates an on-call incident within Blameless where they track all issues and activities during the on-call shift. The postmortem for that on-call incident becomes the complete and detailed handoff for the next person coming in, giving them a confidence boost at the beginning of their shift.

    The SRE dream team

    This story wouldn’t be complete without the mention of our Blameless SREs. In the earlier days, Blameless didn’t have a dedicated SRE team. But as the company shifted its focus toward delivering a rock-solid service, we formed a dedicated SRE crew f (later including some key additional hires, like Amy Tobey). When we put the practice in place, an important distinction was made between our development and SRE teams.

    The SRE team would not be responsible for production services; instead, it was only responsible for the SRE frameworks. Our SREs don’t determine what the dev team’s SLO is going to be, but they are responsible for guiding devs through the process of setting up the SLO and making sure the postmortems are being completed.

    Ashar put it best when he said, “SREs are not going to necessarily resolve incidents for you, but they will be the catalyst to make sure that SRE best practices are being obeyed throughout the process.”Initially, the SRE team’s main focus was to help set up the SLOs for the most critical user journeys in Blameless. During this time, the SRE team also owned infrastructure engineering, the monitoring systems, observability platforms, and key decision making in terms of reliability and tooling.

    After the team laid down this groundwork, the role evolved into being the caretakers of reliability here at Blameless. That means our SREs have two major focuses:

    1. Making sure that we're healthy about tracking all of our key KPIs around reliability.
    2. Governing our reliability practices and making sure our people are disciplined about following those practices.

    These big changes have been a success, yielding significant business impacts for us.

    SREs are not going to necessarily resolve incidents for you, but they will be the catalyst to make sure that SRE best practices are being obeyed throughout the process.

    Blameless today

    After implementing the above practices in less than two months, we’ve been able to deliver a more reliable platform while improving the health of our systems and teams. Both our developers and customers noticed the difference, and responded with overwhelming positivity to these changes. Our customers even began emailing us to express their happiness with the platform’s now rock-solid reliability. The month after our Blameless’ transformation kicked off, we did the highest number of deployments to date, had zero regressions, significantly reduced the number of incidents, and improved our customer satisfaction.

    One of our engineers, Dyllen, was so excited by this massive and speedy overhaul that he wrote his own story of Blameless’ journey. According to Dyllen, “By using Blameless, we identified our critical customer issues, created tickets for tracking progress through our SCRUM process, orchestrated an area for collaboration between my backend engineer, our product owners, and myself. We finally resolved the issue with indexable information on how we as a team will improve our processes to ensure that our product becomes hardened through our growth.”

    These changes also gave our leadership and board more confidence. According to Ashar, “At a business outcome level, what we have now is confidence in the ability to move faster. We have this laser sharp focus. We know what we need to build and focus on. We know how much we can push and what the output of that push is going to be.”

    Our SRE journey in conclusion

    In summary, implementing an SRE practice at Blameless has transformed our business in the following ways:

    • Happier, more productive engineers
    • More confidence in handling on-call
    • Better customer experience
    • Increased platform reliability
    • Focus and alignment on prioritizing engineering work
    • More confidence from our investors and board

    If your team is kicking off its own journey  to SRE and would like some help on where to invest first, we’re here to help. Contact us, and check out the following resources:


    Get similar stories in your inbox weekly, for free



    Share this story:
    blameless
    Blameless

    Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.

    How to Scale End-to-End Observability in AWS Environments

    Latest stories


    How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring

    We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …

    AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost

    In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …

    A Review of Zoho ManageEngine

    Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …

    Should I learn Java in 2023? A Practical Guide

    Java is one of the most widely used programming languages in the world. It has …

    The fastest way to ramp up on DevOps

    You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …

    Why You Need a Blockchain Node Provider

    In this article, we briefly cover the concept of blockchain nodes provider and explain why …

    Top 5 Virtual desktop Provides in 2022

    Here are the top 5 virtual desktop providers who offer a range of benefits such …

    Why Your Business Should Connect Directly To Your Cloud

    Today, companies make the most use of cloud technology regardless of their size and sector. …

    7 Must-Watch DevSecOps Videos

    Security is a crucial part of application development and DevSecOps makes it easy and continuous.The …