Chaos engineering is the act of disrupting and breaking an application system to build resilience. We explained the origin, principles, and benefits of this discipline in this article.
With the rise in microservices and distributed infrastructure, systems failure is harder to control. This was not a problem in the past because infrastructure is hosted and managed on-premise with experienced system administrators ensuring that the infrastructure is consistently delivering.
Now that systems are hosted on globally distributed infrastructures, it's hard to predict what failure might occur to the system.
A 2020 cost of hourly downtime report by the Information Technology Intelligence Consulting (ITIC) shows that 98% of organizations said that 60 minutes of downtime costs more than $150,000 while 40% of enterprises claimed that the same amount of time costs them between $1 million to $5 million.
To reduce costly downtimes as these, Chaos engineering emerged.
What is Chaos Engineering
According to Principles of Chaos, Chaos Engineering is “the discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions in production.”
Gremlin defined it as “a disciplined approach to identify failures before they become outages”.
Chaos engineering involves the process of testing a system against a series of possible failures to determine the system's weaknesses and resiliency.
Origin/history of chaos engineering
Chaos engineering started in 2010 when Netflix engineering created Chaos Monkey in response to their move from physical on-premise infrastructure to AWS. When Netflix moved to cloud infrastructure, they decided to develop the Chaos Monkey to test various failure conditions and ensure that a failed component on AWS will not affect Netflix’s streaming experience.
Netflix further improved Chaos Monkey by introducing a failure injection mode to test against more failure states and enhance the system's resilience even more. Netflix then introduced the Simian Army.
The Simian Army includes tools that test the resiliency of AWS infrastructure with failure modes like disabling an AWS region, dropping an availability zone, introducing communication delays, simulating network outages, and various other security conditions.
When Netflix made Chaos Monkey open source in 2012, it said, "have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a more resilient way."
Gremlin later introduced the first managed enterprise chaos engineering solution in 2016, which increased the adoption of Chaos engineering practices in various industries.
Principles of Chaos Engineering
The principles of chaos engineering describe the ideal step-by-step process of experimenting failure on distributed systems to build confidence in their resilience and reliability.
Define your system's "normal"
The first process of designing a chaos engineering experiment for your system is to define your system's normal state.
It would help if you defined some key metrics like the number of services that must be running and the behavior of your system that will indicate that your system is running normally.
To implement chaos engineering, you need to understand the threshold functionality of your applications and services that determine whether your system is down or functioning correctly.
Build a hypothesis around the normal behavior
After understanding the normal functional state of your system, you can form a hypothesis of how the components of your system will behave when there is a failure in one of them. You should consider all essential elements, including instances, throughput, latency, and I/O performance.
Design real-world events
Outline real-world events that are capable of causing disruptions to your system's normal behavior - events like hardware failure, server failure, software failure, and every other event that could potentially cause downtime to your system.
Run Experiments in production
Based on the system’s normal state and chaotic events defined in previous steps, run an experiment on your system in the production environment. Software tends to behave differently in different environments, so experimentally directly in the production environment will give the best results for improvement.
Minimize blast radius
Because chaos engineering experiments are done directly in the production environment, it is capable of causing real-time adverse effects. To ensure that the investigation is within the resilience capacity of your system, minimize the blast radius, then gradually increase till it reaches full scale.
Rather than running chaotic experiments manually, you should automate the process to keep running continuously and automatically.
Benefits of Chaos Engineering
Chaos engineering offers a lot of benefits to the business, the technical teams and users.
It reduces system and application downtime
Chaos engineering helps to figure out common failures that could cause frequent downtime to the system thereby allowing you to strengthen your system against known failures that could result in downtime.
It minimizes revenue loss due to downtime
An improved system resilience means the system will experience fewer downtime thereby avoiding downtime that lead to loss in revenue.
Improved user experience
Chaos engineering helps the system experience fewer outages and service disruption which will in turn improve the user experience.
It prepares the system against unexpected failures
Chaos engineering allows you to test your system against possible failures there by allowing you to use the information from the experiment to strengthen your system against such failures.
It improves confidence in the system
An improved resilience means you can rely more on the system to deliver uninterrupted business value.
Chaos Engineering in software engineering
Traditional testing of software as we have in software engineering only tests the code’s functionality, responsibility, security, and load testing. This process assumes that all other infrastructural components are always in good shape; therefore, it cannot predict all possible failure modes.
Introducing chaos engineering not only test the applications against series of failure events but also the cloud infrastructure and network failures to ensure that the system is resilient in all stacks of the application
Chaos Engineering in DevOps = Continuous Chaos Engineering
Because of the need to build, deliver and deploy software and applications within a stipulated time frame, software engineers often do not get to test the application’s resilience to a sufficient level. Integrating chaos engineering in DevOps helps to ensure the required test is performed throughout the development process continuously.
Chaos engineering is adopted in DevOps by following the fundamental principles highlighted earlier, then finally integrating chaos testing in the CI/CD pipeline. This helps deploy the chaos testing configuration file and the application and disruption to start in each environment.
Get similar stories in your inbox weekly, for free
Share this story:
In this blog post, we’ll help you ensure that your backup systems will perform as …