Chaos engineering

The complexity of IT solutions and services continues to grow rapidly. Many companies use microservice architecture, whose system architecture has hundreds of thousands of microservices that are in constant contact with each other. It is almost impossible to test each of its operations. Complex systems like this are more prone to failures which can lead to downtime in production. And downtime can be extremely costly for companies. 

According to Statista, the average cost of hourly server downtime costs between 301,000 to 400,000 U.S. dollars. To avoid such losses and prevent expensive outages, companies are increasingly using resilience testing methods, such as chaos engineering, to ensure system stability and readiness in real-life conditions. 

As mentioned in Gremlin’s 2021 State of Chaos Engineering Report, both larger organizations and smaller businesses are adopting the practice to improve the reliability of their applications:

“Chaos Engineering is becoming more and more popular and improving. Netflix and Amazon, the creators of Chaos Engineering, are cutting edge, large organizations, but there is also adoption from more established organizations and smaller teams.” 

Resilience testing belongs to the category of non-functional testing and its goal is to ensure a system’s ability to withstand stress and other challenging factors to continue performing its core functions and avoid loss of data. Samir Nasser, Executive IT Specialist at IBM, defines software resilience as:

“Software solution resiliency refers to the ability of a solution to absorb the impact of a problem in one or more parts of a system, while continuing to provide an acceptable service level to the business.” 

Software testing methods
Software testing methods

Netflix began practicing resilience testing back in 2008 after deciding to move its data center to the cloud—provided by Amazon Web Services (AWS). As part of their resilience testing, Netflix created a tool that would cause breakdowns in their production environment—the environment used by their customers. By doing this, they could verify that a server failure would not have a severe impact on customers. This technique became known as chaos engineering.

Brief history of chaos engineering

  • 2010 The Netflix engineering team created Chaos Monkey, a resiliency tool used to shut down random instances in the production environment to test Amazon’s cloud solution.
  • 2011 The Simian Army was born. It included various tools like Chaos Monkey, Latency Monkey, Conformity Monkey, Security Monkey, Doctor Monkey and Janitor Monkey. 
  • 2012  Netflix shared the source code for Chaos Monkey on Github, saying that they “have found that the best defense against major unexpected failures is to fail often”.
  • 2013 Netflix created the Chaos Kong tool which could shut down an entire AWS Region. 
  • 2014 Netflix decided they would create a new role: the Chaos Engineer.
  • 2016 Gremlin was founded, the world’s first managed enterprise Chaos Engineering solution. 

Understanding chaos engineering

Chaos engineering is a method of experimentation on infrastructure that brings systemic weaknesses to light. Using chaos engineering may be as simple as manually running kill -9 on a box inside your staging environment to simulate failure of a service. Or, it can be as sophisticated as automatically designing and carrying out experiments in a production environment against a small but statistically significant fraction of live traffic.

Applying chaos engineering improves the resilience of a system. By designing and executing chaos engineering experiments, you will learn about weaknesses in your system that could potentially lead to outages that cause customers harm. You can then address those weaknesses proactively, going beyond the reactive processes that currently dominate most incident response models.

How does chaos engineering differ from testing?

The primary difference between chaos engineering and other approaches is that chaos engineering is a practice for generating new information, while other approaches, like fault injection, test one condition. Tests are typically binary and determine whether a property is true or false. This does not generate new knowledge about the system, it just assigns valence to a known property of it.

Examples of inputs for chaos experiments:

  • Simulating the failure of an entire region or datacenter.
  • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
  • Injecting latency between services for a select percentage of traffic over a predetermined period of time.
  • Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
  • Code insertion: adding instructions to the target program and allowing fault injection to occur prior to certain instructions.
  • Time travel: forcing system clocks to fall out of sync with each other.
  • Executing a routine in driver code emulating I/O errors.
  • Maxing out CPU cores on an Elasticsearch cluster.

Example of a chaos engineering experiment using Apimation, Docker containers, and chaos tools

Chaos engineering tools used for the experiment:

  • Gremlin — The first hosted Chaos Engineering service designed to improve web-based reliability. Offered as a SaaS (Software as a Service) technology, Gremlin is able to test system resiliency using one of three attack modes. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Gremlin can also be automated within CI/CD and integrated with Kubernetes clusters and public clouds.
  • Pumba — A command-line tool that performs chaos testing for docker containers. With Pumba, you purposely crash the docker containers running the application to see how the system reacts. You can also perform stress testing on the container resources such as CPU, memory, file system, input/output, etc.

Guidelines for creating the chaos experiment

When creating a chaos experiment, it’s important you follow certain guidelines:

  1. Pick a hypothesis
  2. Choose the scope of the experiment
  3. Identify the metrics you’re going to watch
  4. Notify the organization
  5. Run the experiment
  6. Analyze the results
  7. Increase the scope
  8. Automate

We will put these guidelines into practice in the chaos experiment example below.

Chaos experiment example:

For our chaos experiment, we are going to perform experiments on a simple Dockerized HTTP web service which consists of two Docker containers—a Ruby on Rails web application and PostgreSQL database, which is shown in Diagram 1. We will use Apimation—which is an API testing tool—for load testing and HTTP request generation from multiple “clients” to simulate our live traffic and gather results. In the Rails web container, we have created a simple UsersAPI, for which we can createUsers, readUsers, deleteUsers, and updateUsers. Diagram 1 shows all included components and processes used in this experiment.

{
     “user”: {
          “username”: “Test”,
          “password”: “Password123”
     }
}
Ruby on Rails web application and PostgreSQL database
Diagram 1

Our Apimation CreateUser.yaml file will look like this:

test: step
name: CreateUser
collection:
  name: UserCrud
method: POST
type: raw
url: $baseURL/users
headers:
  Content-Type: application/json
  Accept: application/json
body
  { 
    "user": {
      "username": "apimationUser1", 
      "password": "password123"
      }
  }
assertQuick:
  status: 201 Created
greps:
  - varname: $userId
    type: json
    expression: $.id

And our test case CRUD-tests.yaml file will look like this:

test: case
name: CRUD-tests
description: Crud tests
casedetails:
  vars:
    $user: user
  defaultCollection: UserCrud
  steps:
  - name: CreateUser
  - name: ReadUser
  - name: UpdateUser
  - name: DeleteUser

Before we begin conducting any tests, we have to get baseline results that we can compare our experiment results with later on. Baseline results in this case will be: How many requests per second can my web application withstand before crashing and stopping processing requests? We can find this out by creating an incremental load test using Apimation.

test: load
name: increment_LoadTest_100
loadtype: increment
details:
    testCase: CRUD-tests
    environment: Staging 
    startRPS: 10
    maxDuration: 10m
    incrementRPS: 1
    incrementInterval: 20s
    warmUp: 1s
    workerType: loc

After running our incremental load test, we verify results in an Apimation generated report. In our case, 20 RPS (requests per second) for our local web service was too much, so we chose 15 RPS as the baseline. After conducting the constant load test with 15 RPS for 150 seconds, we can see steady results in the Apimation report—with an average request delay of 103.49 ms:

Apimation load test report
Apimation report

The next step is to introduce some “chaos” into our web service. We will conduct network attacks using the Pumba tool and performance attacks using the Gremlin tool, while Apimation sends a constant load of 15 RPS. 

For network attacks there are many possible scenarios—traffic delay, traffic loss, packet corruption, traffic limitations, etc. For our example, we will do a 50ms traffic delay, which we can execute with the following command in our CLI tool Pumba:

“pumba netem --duration 160s --tc-image <pumba image> delay --time 50 --distribution normal <docker container>”

To test performance, we will run a CPU attack and set it to 20 percent. Gremlin has many possible attack options and the UI is user-friendly. First you choose what you want to attack. For us, it’s the Rails web container. Next, you choose the type of attack and then provide test details. Gremlin also generates reports and together with our Apimation report we can get a good amount of data to further analyze.

Gremlin report 1
Gremlin report 1
Gremlin report 2
Gremlin report 2

Experiment results

After each experiment we can gather some metrics from our Apimation generated load report—such as average sent requests and average delay time—and then do some further analysis based on our baseline results. In this case, for a 50ms introduced network delay, we can see requests are failing after 40 seconds into our test, with an average delay of more than 15 seconds. Network traffic delay

Conclusion

Chaos engineering is an important step in the software development life cycle because it demonstrates a system’s resiliency and readiness for production and real-life system disruptions. It’s good to note that chaos engineering is not just a way to break a system, but rather to experiment with it and obtain new information about it in order to build more resilient systems.

Want to find out more about chaos engineering and how it can benefit you? Get in touch and let’s talk about your project.

Subscribe

Subscribe now to our newsletter