How to Classify Incidents
Originally published on Failure is Inevitable.
What is incident classification?
Incident classification is a standardized way of organizing incidents with established categories. Incidents can include outages caused by errors in code, hardware failures, resource deficits — anything that disrupts normal operations. Each new incident should fit into a category dependent on the areas of the service affected, and in a ranking of the severity of the incident. Each of these classifications should have an established response procedure associated with it.
In this blog, we’ll look at some benefits of classifying incidents, how classification is distinguished from incident triage, how to set up your own classification system, and how ITIL handles incident classification as an example.
Why classify incidents?
Having a robust classification system is beneficial for many reasons:
- Improves triage by ensuring you respond to the most critical incidents first
- Determines who should be alerted and what roles they should play in resolution
- Helps with consistent responses, saving time and toil, and reducing confusion about how other people will proceed
- Measures the expected impact of incidents by type for longer term planning
- Identifies patterns in incident occurrences to prioritize preemptive fixes
Incident Severity vs Priority
When classifying an incident, assessing the impact that the incident will have on your service is essential to responding properly. Is it causing a small delay in loading a page, or is it causing total outages across the site? Without understanding the severity of the incident, you won’t understand the time constraints for your response or the consequences of prioritizing or de-prioritizing the issue in light of other work.
However, the severity of the incident doesn’t entirely dictate the priority of the incident, which is where it falls on the “to-do list” of those responding. Although high severity incidents will likely demand a quick response, circumstances such as development cycles and resource availability could also put other projects ahead. Conversely, a low severity incident may only need a quick fix with few resources, making it a high priority target.
In general, incident classification provides valuable information for prioritizing incidents, but is separate from the triage process itself. Severity can be fluid, assessed differently from different perspectives. To resolve the highest priority incidents as quickly as possible, severity must be incorporated into a larger context.
How to create incident categories
- Determine what types of responses are required
Ultimately, you want each classification of incident to map to a particular response, such that knowing the classification is enough to know what process will need to be implemented. Working backwards, a good way to come up with classifications is to think about what different responses could be necessary.
Think about who would need to be contacted for incidents in different service areas. These lines of ownership can help establish categories of incidents. These can be broken down into subsections by considering when different resources and procedures, like playbooks, should be utilized. Another useful technique is tagging incidents with keywords, such as “hardware,” “recurring,” or “related to sprint [xyz]”. This can help create distinction between incidents that still fall in the same service area and connection between incidents in different categories, helping find patterns when reviewing incidents.
After you have categories of service areas, create tiers of how severely those areas could be impacted. When would you need to escalate a response, contacting more people and deploying more resources? Looking at your SLOs and SLAs can help you determine and prioritize severity based on customer impact, giving insight on when escalations should occur. For example, if you have an SLO for median page load time, you could break down your levels of severity based on how much the incident increases this metric:
2. Set metrics to classify incidents into categories
Once you have a matrix of categories of impact and tiers of severities, it’s important to have clearly defined metrics for reliably classifying new incidents. These should be objective so any team member will classify the incident in the same way. Sometimes several metrics should be combined formulaically to determine severity. For example, you could look at the median amount of page load delay multiplied by the frequency that the particular page is loaded. SLIs and SLOs can help you find which metrics indicate customer impact most directly.
Where possible, use metrics that can be automatically monitored to trigger alerts. As classification of the incident may require more analysis, you shouldn’t aspire for fully automated classification. Instead, these metrics help alert people to the existence of the incident. Use registries of service ownership and system architecture maps to determine the first responder and classifier.
3. Integrate your classifications into your incident response system
Now that you have your classifications and understand which metrics delineate them, it’s time to integrate them into a larger incident response system. Set your alerting tools to notify team members based on the incident’s classification. Codify your runbooks and playbooks based on which classes of incidents they apply to. You may want to have automated runbooks start to execute in response to a new incident’s classification, allowing for a fast and consistent response.
Many steps of your incident response process can be determined by the incident classification, reducing the mental toil of making these choices in the heat of the moment. Response systems like checklists, assigned roles for responders, and war rooms can be created based on the classification. Tools such as Blameless can automate the toil from these processes, making resolving incidents faster and more consistent.
As you work through the incident, create a postmortem or incident retrospective document to log the procedure for further review. Have standards for what details should be collected and included in the retrospective based on the incident’s classification. Tools like Blameless can help you automatically collect and organize this data into customizable structures.
4. Review, learn, and revise
The golden rule of reliability is that failure is inevitable, so plan for your classifications to need ongoing revision. Have regular sessions to review incident retrospectives in the context of the classification system. Look for areas of ambiguity where incidents could fall into multiple categories—could there be clearer rules to sort them? Look for incidents where a response differed from what was dictated by the classification—was it misclassified? Or should the response playbook for that classification change?
There will be incidents that don’t fall cleanly into any established category. Trying to determine how to classify these novel incidents can slow response procedures or lead to suboptimal responses. Review these incidents and see how existing classifications can be expanded to include them or if new categories need to be established. Determine a set procedure for unknown incidents based on the categories they fall closest to so a response can begin without unnecessary hesitation.
As you use your classification system, you’ll build up a valuable history of classified incidents.
Look for patterns such as:
- Which services have the most incidents?
- Where are the most severe incidents occurring?
- How do resolution timelines correspond with severity in different categories?
From analyzing this data, you’ll make valuable insights on where reliability development efforts could be focused.
ITIL Incident Classification System
As an example, let’s look at the ITIL incident classification system. The ITIL system outlines a classification process using two factors: the category of the incident and the priority of the incident.
First, determine the category of the incident by looking at the service area affected. This can be delineated by considering who should be contacted as an owner of the service. Depending on your system architecture, where these lines are drawn can change. Perhaps one team has ownership on all hardware operations, making “hardware” a useful incident category. If hardware operations are divided by the service using the hardware, then breaking categories down by service makes more sense.
Next, determine the incident’s priority. As we discussed earlier, an incident’s severity is a factor in priority, but isn’t the only factor. In the ITIL system, priority is encoded in the incident’s classification and is based on two factors: impact and urgency. Impact is like severity: you assess the size of disruption the incident will have on normal operations. Urgency looks at the rate at which this disruption increases if the incident goes unresolved. The impact and urgency of an incident can be combined into a matrix showing the incident’s overall priority, as discussed here.
To see how Blameless can help you learn the most from your incidents, try us out for free. If you liked this article and want to read more, check out these:
Get similar sotries in your inbox weekly, for free
Share this story with your friends
Blameless is the industry's first end-to-end SRE platform, empowering teams to optimize the reliability of their systems without sacrificing innovation velocity.