Introduction to Chaos Engineering

in DevOps , Continuous Integration , Software Development


Chaos engineering is the act of disrupting and breaking an application system to build resilience. We explained the origin, principles, and benefits of this discipline in this article.

    With the rise in microservices and distributed infrastructure, systems failure is harder to control. This was not a problem in the past because infrastructure is hosted and managed on-premise with experienced system administrators ensuring that the infrastructure is consistently delivering.

    Now that systems are hosted on globally distributed infrastructures, it's hard to predict what failure might occur to the system.

    A 2020 cost of hourly downtime report by the Information Technology Intelligence Consulting (ITIC) shows that 98% of organizations said that 60 minutes of downtime costs more than $150,000 while 40% of enterprises claimed that the same amount of time costs them between $1 million to $5 million.

    To reduce costly downtimes as these, Chaos engineering emerged.

    What is Chaos Engineering

    According to Principles of Chaos, Chaos Engineering is “the discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions in production.”

    Gremlin defined it as “a disciplined approach to identify failures before they become outages”.

    Chaos engineering involves the process of testing a system against a series of possible failures to determine the system's weaknesses and resiliency.

    Origin/history of chaos engineering

    Chaos engineering started in 2010 when Netflix engineering created Chaos Monkey in response to their move from physical on-premise infrastructure to AWS. When Netflix moved to cloud infrastructure, they decided to develop the Chaos Monkey to test various failure conditions and ensure that a failed component on AWS will not affect Netflix’s streaming experience.

    Netflix further improved Chaos Monkey by introducing a failure injection mode to test against more failure states and enhance the system's resilience even more. Netflix then introduced the Simian Army.

    The Simian Army includes tools that test the resiliency of AWS infrastructure with failure modes like disabling an AWS region, dropping an availability zone, introducing communication delays, simulating network outages, and various other security conditions.

    When Netflix made Chaos Monkey open source in 2012, it said, "have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a more resilient way."

    Chaos Monkey Logo Chaos Monkey Logo

    Gremlin later introduced the first managed enterprise chaos engineering solution in 2016, which increased the adoption of Chaos engineering practices in various industries.

    Principles of Chaos Engineering

    The principles of chaos engineering describe the ideal step-by-step process of experimenting failure on distributed systems to build confidence in their resilience and reliability.

    Define your system's "normal"

    The first process of designing a chaos engineering experiment for your system is to define your system's normal state.

    It would help if you defined some key metrics like the number of services that must be running and the behavior of your system that will indicate that your system is running normally.

    To implement chaos engineering, you need to understand the threshold functionality of your applications and services that determine whether your system is down or functioning correctly.

    Build a hypothesis around the normal behavior

    After understanding the normal functional state of your system, you can form a hypothesis of how the components of your system will behave when there is a failure in one of them. You should consider all essential elements, including instances, throughput, latency, and I/O performance.

    Design real-world events

    Outline real-world events that are capable of causing disruptions to your system's normal behavior - events like hardware failure, server failure, software failure, and every other event that could potentially cause downtime to your system.

    Run Experiments in production

    Based on the system’s normal state and chaotic events defined in previous steps, run an experiment on your system in the production environment. Software tends to behave differently in different environments, so experimentally directly in the production environment will give the best results for improvement.

    Minimize blast radius

    Because chaos engineering experiments are done directly in the production environment, it is capable of causing real-time adverse effects. To ensure that the investigation is within the resilience capacity of your system, minimize the blast radius, then gradually increase till it reaches full scale.

    Automate Experiments

    Rather than running chaotic experiments manually, you should automate the process to keep running continuously and automatically.

    Benefits of Chaos Engineering

    Chaos engineering offers a lot of benefits to the business, the technical teams and users.

    It reduces system and application downtime

    Chaos engineering helps to figure out common failures that could cause frequent downtime to the system thereby allowing you to strengthen your system against known failures that could result in downtime.

    It minimizes revenue loss due to downtime

    An improved system resilience means the system will experience fewer downtime thereby avoiding downtime that lead to loss in revenue.

    Improved user experience

    Chaos engineering helps the system experience fewer outages and service disruption which will in turn improve the user experience.

    It prepares the system against unexpected failures

    Chaos engineering allows you to test your system against possible failures there by allowing you to use the information from the experiment to strengthen your system against such failures.

    It improves confidence in the system

    An improved resilience means you can rely more on the system to deliver uninterrupted business value.

    Chaos Engineering in software engineering

    Traditional testing of software as we have in software engineering only tests the code’s functionality, responsibility, security, and load testing. This process assumes that all other infrastructural components are always in good shape; therefore, it cannot predict all possible failure modes.

    Introducing chaos engineering not only test the applications against series of failure events but also the cloud infrastructure and network failures to ensure that the system is resilient in all stacks of the application

    Chaos Engineering in DevOps = Continuous Chaos Engineering

    Because of the need to build, deliver and deploy software and applications within a stipulated time frame, software engineers often do not get to test the application’s resilience to a sufficient level. Integrating chaos engineering in DevOps helps to ensure the required test is performed throughout the development process continuously.

    Chaos engineering is adopted in DevOps by following the fundamental principles highlighted earlier, then finally integrating chaos testing in the CI/CD pipeline. This helps deploy the chaos testing configuration file and the application and disruption to start in each environment.

    Get similar stories in your inbox weekly, for free

    Share this story with your friends
    The Chief I/O

    The team behind this website. We help IT leaders, decision-makers and IT professionals understand topics like Distributed Computing, AIOps & Cloud Native

    Latest stories

    DevOps: Report on Devil's Practices by DORA

    The report is drafted from a report release of the annual research and survey of …

    Amazon Elasticsearch Gets a New Version With Name Deprecated

    Accompanied by new advancements is Amazon OpenSearch, the same body of code as its predecessor, …

    McAfee Partners With IBM Security to Deliver TD Synnex Security Solution

    The MVISION platform and Security wing of IBM's partnership endgame are to extend increased protection …

    Amazon MSK Connect Launched to Better Apache Kafka UX

    Amazon follows up on its 2018 data streaming software, Amazon Managed Streaming for Apache Kafka, …

    Cloud: Zone Redundant Storage Released on General Availability

    The report is drafted from a press release of the Microsoft Azure team on the …

    Security: IBM Traces Two-Thirds of Compromises to Misconfigured APIs

    The report is drafted from a sweeping survey of dark web analysis and various X-Force …