4 Things you Need to Know about Writing Better Production Readiness Checklists
Checklists can help limit errors when deploying code to production. In this blog post, we’ll cover:
* How to make a production checklist
* Why production checklists are helpful
* Keeping your checklist up to date
* How Blameless can help integrate your checklists
When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness.
Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover:
- How to make a production checklist
- Why production checklists are helpful
- Keeping your checklist up to date
- How Blameless can help integrate your checklists
How to make a production checklist
Production checklists should be holistic. They should cover everything from launch logistics to contingency plans for failure. Let’s break down what you’ll need for a thorough checklist.
- Determine the service level of what you’re launching
To determine how thorough your checklist should be, consider what level of reliability your customers need.. You may be tempted to be as comprehensive as possible with every checklist, but that costs time and may be unnecessary. At Mercari, the service level is determined based on the service’s SLO. Services that are critical to business success are scrutinized more than niche services
2. Map out all the checklist areas
List all major components of your service. These components may be under the ownership of various teams. For example, you’ll likely need to consult server management teams, testing teams, and many others. It’s important to know as soon as possible whom you’ll need to consult. Some areas to consider include:
- Server-side: What machines will this service run on? If you’re cloud-based, will your plan cover the new service’s load?
- Client-side: Is your service usable for all potential clients?
- Monitoring: Do you have ways of collecting data from your new service?
- Growth: Do you have a roadmap for how you will maintain or improve the service going forward? What if usage increases? What if you need to expand functionality?
- Dependencies: What other in-house and third party services does your service depend on? Will they integrate smoothly?
- Testing: Has the new service been tested in an environment mirroring production?
- Security: Will your new service pass your security audits?
- Reliability: What level of reliability will your users expect? Do you have a plan for when you are unable to meet these expectations?
- Incident response: What will you do if an incident causes service interruption or degradation? Do you have runbooks to cover these incidents?
- Legal: Do you have an SLA that guarantees availability? Does this service deal with personal information that must be kept secure?
- Logistics: What is the launch schedule? What resources will you need?
For more examples of areas to consider, check out Google’s Launch Coordination Checklist, gruntwork.io’s AWS checklist, or Mercari’s checklists.
3. Prepare the checklist items
Each of these areas contains many issues, and requires data to answer. Your checklist should ask for each piece of data. Here’s an example of how certain sections could be broken down:
You may also want to include information on who to consult to check off each item, and the timeframe for being able to check it. Build your checklist and check items off as development progresses. Double check to ensure that items are ready to go. Right before launch, do a final check through the whole list, just in case.
Keeping the checklist in check
As you develop, you’ll likely find more areas you want to vet prior to launch. To keep your checklist from becoming too long, you’ll need a system to make sure new additions are helpful. At Google, teams have two criteria for adding an item to the checklist:
- “Every question’s importance must be substantiated, ideally by a previous launch disaster.”
- “Every instruction must be concrete, practical, and reasonable for developers to accomplish.”
You can determine criteria based on the service level you’ve assigned. It’s better to have an unnecessary item than to lack one you need. It’s okay to start with a big checklist, then remove items after each launch that proved to not be useful.
Why are production checklists helpful?
Production checklists can seemingly add overhead to engineers’ jobs. However, the upfront work can save teams from future problems and ensure a successful launch. Production checklists help:
- Remove the cognitive toil of having to remember everything
- Identify possible problems ahead of time
- Prepare resources ahead of time
- Motivate development to complete necessary items
- Prioritize key requirements vs unnecessary additions
- Ensure contingency planning, improving reliability
- Keep everyone in the loop throughout development as a centralized progress meter
How to keep your production checklist up to date
You will need to review and revise your checklists periodically to keep them useful. Be sure to revisit them at these times:
When development on a new service starts. When mapping out a new service, consider which production checklist to use when it launches. Based on the type of service and service level, find the closest checklist you have. Review it to make sure it follows the processes and architecture you currently use. Add any service-specific requirements as you develop.
After a launch. Take a look at the production checklist after you launch the new service. Were there any problems with the launch? Could they have been checked for beforehand? Look for checklist items that were misunderstood and filled out incorrectly. Revise these items to ensure the checklist lines up with the reality of development.
After an incident. If an incident impacts the new service, see if any of the contributing factors could have been addressed with the checklist If so, try to capture those items on future checklists. This task can be incorporated into your incident retrospectives.
As part of regular review cycles. Set a schedule to review tools like runbooks and production checklists. Make sure to invite all team members who will be required to use these runbooks or checklists. Each of these people can provide insight on what to improve moving forward.
How Blameless can help integrate checklists
To get the most from your checklists, you need to integrate them into your workflows. Here’s how Blameless can help:
- Blameless Incident Resolution allows teams to treat each deploy like an incident and assign roles and checklists.
- Blameless Incident Retrospectives provide a hub of learning for future checklist development.
- Blameless Runbook Documentation helps richly document processes, allowing you to dive into the information behind each checklist item.
To see more of how Blameless helps you be your most reliable, check out a demo.
If you enjoyed this blog post, check out these resources:
Get similar stories in your inbox weekly, for free
Share this story:
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.
The all-in-one monitoring solution for IT admins, DevOps and SREs
Get deep visibility into the performance of your complex enterprise applications and cloud native workloads. Identify potential issues, improve productivity, and ensure that your business and end users are unaffected by downtime and substandard performance ...
How ManageEngine Applications Manager Can Help Overcome Challenges In Kubernetes Monitoring
We tested ManageEngine Applications Manager to monitor different Kubernetes clusters. This post shares our review …
IT Monitoring Powered by AIOps
Harness the power of artificial intelligence (AI) and machine learning (ML) to monitor your IT resources with Site24x7's artificial intelligence for IT operations (AIOps) and machine learning operations (MLOps). Improve mean time to repair (MTTR) issues with the help of Site24x7 AIOps ...
AIOps with Site24x7: Maximizing Efficiency at an Affordable Cost
In this post we'll dive deep into integrating AIOps in your business suing Site24x7 to …
A Review of Zoho ManageEngine
Zoho Corp., formerly known as AdventNet Inc., has established itself as a major player in …
Should I learn Java in 2023? A Practical Guide
Java is one of the most widely used programming languages in the world. It has …
The fastest way to ramp up on DevOps
You probably have been thinking of moving to DevOps or learning DevOps as a beginner. …
Why You Need a Blockchain Node Provider
In this article, we briefly cover the concept of blockchain nodes provider and explain why …
Top 5 Virtual desktop Provides in 2022
Here are the top 5 virtual desktop providers who offer a range of benefits such …