Cloud Solutions
  • Home
  • About
  • Services
    • AWS Cloud Consulting
    • DevOps Consulting
    • AWS Cloud & DevOps Training
  • Customers
  • Blog
  • Contact
April 5, 2022
DevOps

What is Chaos Engineering?

What is Chaos Engineering?
April 5, 2022
DevOps

Chaos engineering is a type of quality assurance (QA) and software testing practice that is increasingly gaining popularity among software teams engaged with distributed software systems. 

Although still novel, it is becoming an integral part of today’s DevOps. To illustrate, Gartner expects that by 2023, 40% of organizations will embrace chaos engineering as part of their regular DevOps practices to reduce system downtime by 20%.

So, what is chaos engineering?

According to Forrester, it would be a misnomer to associate chaos engineering with something chaotic as the experiments it puts in place are quite controlled. Instead, we better think of it as a “vaccine” or a “flu shot” where a system gets injected with a controlled piece of harm meant to prevent a bigger issue or distress. 

In today’s article, we are going to explain the approach, see its major benefits and reasons for its application. Then, we will touch upon its main tools and principles as well as uncover some recent data illustrating its state-of-the-art application in 2021. 

Before we begin…How do you picture chaos? No worries if by the end of this article you end up with an image of a monkey! There may be a reason…

Chaos Engineering In A Nutshell - Blog - Cloud Solutions

In a nutshell

Largely speaking, chaos engineering is a DevOps practice aimed at preventing unexpected system crashes by running controlled failure experiments to enhance the system’s production capabilities. It treats failures in distributed software systems as inevitable and thus prioritizes their prevention. 

The history of the concept is often attributed to Netflix. It launched the practice some ten years ago while working on its transition to the cloud. However, let’s see how it developed to the present.

Definition

A key theoretical presumption in chaos engineering is that since real-world events may affect the production environment, systems are inherently chaotic. Here comes the most widely-quoted definition of the concept. 

“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Hence, early identification and prevention of potential system issues are needed in order to prevent unwanted impact on other aspects of the overall system functioning. As also implied in the definition, the approach aims at preventing unpredictable outcomes which may result from rare but troublesome events that question the system’s certainty at scale. 

This is particularly relevant to cloud systems, which are characterized by a high degree of uncertainty as their environment hides many different unknowns. 

Benefits 

A key benefit of the approach stems from its potential to reduce system downtime and its associated costs. As these may become significantly high, this also affects the overall system success or failure. 

Two of the most common metrics for measuring concrete benefits here are the so-called Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR), where: 

  • MTTD estimates the average time to identify a problem, 
  • MTTR calculates the average time to resolve it.

Based on this, chaos engineering enhances the reliability of the system, which also contributes to ensuring the expected customer experience.

What are the approach prerequisites? 

Now, let’s focus a bit on the practical side of applying the approach. As mentioned earlier, there is a particular way of doing it.

A key set of principles

Overall, there are 5 key principles underlying the chaos engineering approach. They focus on the idea that 1) your system has a “steady state” of performance, 2) there are real-life events that may impact it, 3) you can anticipate them via experiments, 4) you can further automate these experiments and 5) continuously minimize unwanted effects. 

Against this background, a practical approach for implementing the method may include the following steps:

  1. Define your measurable steady state and build a hypothesis for its maintenance, i.e. select your output metrics such as overall system’s throughput, error rate, etc.
  2. Identify your relevant real-life events such as server crashes, malfunction of hard drives, compromised network connections, etc. 
  3. Design your production experiments to identify potential changes against your steady state.
  4. Automate experiments to continuously monitor and analyze behavior.
  5. Based on experiments, ensure that failures are controlled and reduced to a minimum. 

You can always build on this framework with various tools, white papers and guides.

Key tools for chaos engineering

Netflix Chaos Monkey - Blog - Cloud Solutions
Source: Netflix

Historically, one of the best-known tools (now open-source) for running chaos engineering is the Chaos Monkey originally launched by Netflix. You may have already come across this wicked monkey image, haven’t you?  

Despite its fame, many experts today consider the application of the Chaos Monkey limited. In fact, there already are many modern tools enabling chaos engineering experiments. The most prominent of these (especially for Kubernetes) include Chaos Blade, Chaos Mesh, Litmus, etc. 

For a more detailed comparison, please see here.

State-of-the-art 

The 2021 State of Chaos Engineering Report confirmed that the chaos engineering approach is gaining increasing prominence. This is what some of its key findings further revealed: 

  • the most frequent benefits brought about by the approach are increased availability (>99.9%) and reduced MTTR (under 1 hour for 23% of teams)
  • network attacks are the most commonly applied type of experiment
  • 60% of respondents have run at least one chaos engineering attack, whereas 34% of them run experiments in production

You can explore additional report insights here.

Conclusion

Chaos engineering is a contemporary DevOps practice, particularly suited for improving the predictability of system performance in cloud environments. Based on its application data, increasingly more organizations today apply it to reduce system downtime, enhance costs and overall customer satisfaction. If you want to learn more if and how this may apply to you, do not waste a minute more but give us a call to discuss it.

Previous articleWhat is NFT and what makes it different?What Is NFT - Blog - Cloud Solutions

About

Cloud Solutions is a new-era consultancy company who is an accelerator for closing the technology gap through the adoption of AWS Cloud and the leading DevOps practices.

Services

AWS Cloud Consulting
DevOps Consulting
AWS Cloud & DevOps Training

Contact

Mladost 4
Sofia, Bulgaria
REG: 206023399
+359 (0) 877 999902
contact@cloudsltns.com

Categories

  • Achievements (2)
  • Amazon Web Services (19)
  • AWS & DevOps Monthly Recap (4)
  • AWS Basics (7)
  • AWS re:Invent (2)
  • Business (11)
  • Cloud Trends (2)
  • Coronavirus (1)
  • DevOps (15)
Cloud Solutions - Your trusted AWS Cloud and DevOps partner

About

Cloud Solutions is a new-era consultancy company who is an accelerator for closing the technology gap through the adoption of AWS Cloud and the leading DevOps practices.

Services

AWS Cloud Consulting
DevOps Consulting
AWS Cloud & DevOps Training

Contact

str. Akad. Yordan Trifonov 8, entry B, floor 2, office 39
1700 Sofia, Bulgaria
REG: 206023399
+359 (0) 877 999902
contact@cloudsltns.com

Achievements

Copyright © 2022 Cloud Solutions. All rights reserved.