How We Use Blameless to Power Remote Work
Here are some of the top workflows and tips on how we have been using Blameless internally to streamline remote productivity.
Originally published on Failure is Inevitable.
As with all other companies, the Blameless team is adapting to a world of remote work where distributed teams will need to get better than ever at staying aligned and efficient. We’ve been relying on Blameless more and more to improve how we collaborate virtually. Here are some of the top workflows and tips on how we have been using Blameless internally to streamline remote productivity.
Customer and Engineering Incidents
There seems to be only one guarantee in this world: Incidents will happen.It’s not a question of if, but when. To be prepared for those incidents, both customer and engineering, we use Blameless’ automated incident resolution and incident retrospective capabilities to ensure that we’re handling issues as they come and learning from them to improve our reliability and development velocity, all while our teams are working remotely.
Automated incident response
Automating the toil out of incident response is critical for improved MTTR, and essential when working remotely. Rather than having a home base or war room where our engineers can gather and whiteboard through an incident, we’ve been relying on virtual war rooms and Slack channels in order to communicate. With Blameless, this process is instantaneous.
Our intelligent chatbot automates incident coordination context (such as spinning up a Slack channel and Zoom room), key tasks and workflows, and allows us to see who has checked in to an incident, giving full visibility into the team members participating. Additionally, our chat bot collects key details during the incident, such as role assignments for incident commander or communications lead.
Beyond the chatbot function, we also use Blameless’ runbook automation to standardize incident processes with responses and runbooks for different incident types. This way, even though our teams are currently distributed, our process remains, allowing us to resolve incidents faster and create more thorough incident retrospectives.
After an incident, it’s important for our team to dig into why an incident occurred, and work to learn from the experience. As we are all remote, it’s even more important to create a comprehensive narrative and get the information about the incident from team members’ brains onto paper. Blameless makes this process painless for us by driving asynchronous collaboration.
Our automated timeline associated with the incident logs all important actions taken during the resolution process, meaning our engineers don’t have to dig through multiple conversation logs to get a clear picture of what occurred when.
Once we have the timeline, we move to our analysis page and begin to write personal narratives of what occurred. The editing and commenting functions come in handy here, as multiple team members can log into the retrospective and begin adding context while the information is still at the forefront of their minds. Even better, one person doesn’t have the sole task of gathering everyone’s information and completing this doc all on their own. Our retrospectives are team oriented, and can be done from our living room couches.
Lastly, we use Blameless to track our follow-up action items. These are often forgotten, or lost during the retrospective process. With our tracking abilities, we have full visibility into the progress our team makes on completing these. That way, no critical tasks are left behind and the team can be on the same page about the status of action items without having to message back and forth in distributed channels.
Releases and Deployments
Our team also uses Blameless for releases and deployments. This helps standardize and document our processes so they are repeatable. Especially in these times of WFH, it can be difficult to ensure that all teams are following the same protocols for releases and deployments. With Blameless, we can make sure all teams are on the same page.
At the beginning of a deployment, we create an incident for the push in our platform. We then assign roles (commander, communications lead, etc.) and invite the team members engaged in this deployment to check in to the incident. This includes alerting CRE and QA teams. Blameless automatically documents the change or version, and creates a task list for team members to work through in order to complete the push, monitor impact, and take action as needed.
Our check list is also captured by our chat bot and added into the retrospective timeline. This allows our team to be able to look back at the process, register who completed which tasks, and create a more thorough record of all our deployments.
Error Budget Depletion
We also use Blameless to help monitor our error budgets. With the increased demand on digital services to keep the world up and running during this time, many systems are under significant strain. It’s important to know when our system is reaching a crucial point where human intervention is required in order to keep our customers happy.
Our team regularly watches our error budget depletion dashboard, just to make sure everything is ship-shape. However, if we receive an alert that our budget is consuming more than usual, or reaches our alerting threshold, an incident is triggered with an automated runbook. Our team then uses our checklist to work through the incident, assigning roles, assessing impact, and eventually creating a meeting with the involved team to conduct an incident retrospective.
If all is fine, and our error budget is depleting at a normal cadence in accordance with our 28-day window, we also use our error budget dashboard as a way to determine whether or not we can ship new features, or if the increase in demand requires us to focus more on reliability work. With our teams distributed, this basis for decision making is crucial. It gives our team a common language to use when discussing the impact of a particular deployment.
While not all companies are security companies, each company needs to take security seriously, and Blameless helps our teams do that. We use Blameless to create incidents for security compliance initiatives we undertake. This allows us to loop the entire team into the incident, automatically create a progress timeline, and keep track of the action items that will make our platform more secure. Though our teams are distributed, we can easily move forward towards the same goal.
Working remotely has been a challenge, and one we, like many companies, were not completely prepared for prior to the current circumstances. However, using Blameless, we have been able to remain productive by establishing guardrails around certain processes to streamline communication, collaborate well, and keep our customers happy. And it wouldn’t be possible without Blameless as well as the plethora of other digital tools we run on to help keep us in sync. .
If you want to read more about remote work or how industry leaders are handling this black swan event, take a look at these articles below:
Get similar stories in your inbox weekly, for free
Share this story:
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …