Chaos engineering is a type of quality assurance (QA) and software testing practice that is increasingly gaining popularity among software teams engaged with distributed software systems.
Although still novel, it is becoming an integral part of today’s DevOps. To illustrate, Gartner expects that by 2023, 40% of organizations will embrace chaos engineering as part of their regular DevOps practices to reduce system downtime by 20%.
So, what is chaos engineering?
According to Forrester, it would be a misnomer to associate chaos engineering with something chaotic as the experiments it puts in place are quite controlled. Instead, we better think of it as a “vaccine” or a “flu shot” where a system gets injected with a controlled piece of harm meant to prevent a bigger issue or distress.
In today’s article, we are going to explain the approach, see its major benefits and reasons for its application. Then, we will touch upon its main tools and principles as well as uncover some recent data illustrating its state-of-the-art application in 2021.
Before we begin…How do you picture chaos? No worries if by the end of this article you end up with an image of a monkey! There may be a reason…
In a nutshell
Largely speaking, chaos engineering is a DevOps practice aimed at preventing unexpected system crashes by running controlled failure experiments to enhance the system’s production capabilities. It treats failures in distributed software systems as inevitable and thus prioritizes their prevention.
The history of the concept is often attributed to Netflix. It launched the practice some ten years ago while working on its transition to the cloud. However, let’s see how it developed to the present.
Definition
A key theoretical presumption in chaos engineering is that since real-world events may affect the production environment, systems are inherently chaotic. Here comes the most widely-quoted definition of the concept.
“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Hence, early identification and prevention of potential system issues are needed in order to prevent unwanted impact on other aspects of the overall system functioning. As also implied in the definition, the approach aims at preventing unpredictable outcomes which may result from rare but troublesome events that question the system’s certainty at scale.
This is particularly relevant to cloud systems, which are characterized by a high degree of uncertainty as their environment hides many different unknowns.
Benefits
A key benefit of the approach stems from its potential to reduce system downtime and its associated costs. As these may become significantly high, this also affects the overall system success or failure.
Two of the most common metrics for measuring concrete benefits here are the so-called Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR), where:
- MTTD estimates the average time to identify a problem,
- MTTR calculates the average time to resolve it.
Based on this, chaos engineering enhances the reliability of the system, which also contributes to ensuring the expected customer experience.
What are the approach prerequisites?
Now, let’s focus a bit on the practical side of applying the approach. As mentioned earlier, there is a particular way of doing it.
A key set of principles
Overall, there are 5 key principles underlying the chaos engineering approach. They focus on the idea that 1) your system has a “steady state” of performance, 2) there are real-life events that may impact it, 3) you can anticipate them via experiments, 4) you can further automate these experiments and 5) continuously minimize unwanted effects.
Against this background, a practical approach for implementing the method may include the following steps:
- Define your measurable steady state and build a hypothesis for its maintenance, i.e. select your output metrics such as overall system’s throughput, error rate, etc.
- Identify your relevant real-life events such as server crashes, malfunction of hard drives, compromised network connections, etc.
- Design your production experiments to identify potential changes against your steady state.
- Automate experiments to continuously monitor and analyze behavior.
- Based on experiments, ensure that failures are controlled and reduced to a minimum.
You can always build on this framework with various tools, white papers and guides.
Key tools for chaos engineering
Historically, one of the best-known tools (now open-source) for running chaos engineering is the Chaos Monkey originally launched by Netflix. You may have already come across this wicked monkey image, haven’t you?
Despite its fame, many experts today consider the application of the Chaos Monkey limited. In fact, there already are many modern tools enabling chaos engineering experiments. The most prominent of these (especially for Kubernetes) include Chaos Blade, Chaos Mesh, Litmus, etc.
For a more detailed comparison, please see here.
State-of-the-art
The 2021 State of Chaos Engineering Report confirmed that the chaos engineering approach is gaining increasing prominence. This is what some of its key findings further revealed:
- the most frequent benefits brought about by the approach are increased availability (>99.9%) and reduced MTTR (under 1 hour for 23% of teams)
- network attacks are the most commonly applied type of experiment
- 60% of respondents have run at least one chaos engineering attack, whereas 34% of them run experiments in production
You can explore additional report insights here.
Conclusion
Chaos engineering is a contemporary DevOps practice, particularly suited for improving the predictability of system performance in cloud environments. Based on its application data, increasingly more organizations today apply it to reduce system downtime, enhance costs and overall customer satisfaction. If you want to learn more if and how this may apply to you, do not waste a minute more but give us a call to discuss it.