What Is Chaos Testing? Guide to Chaos Monkeys in Chaos Engineering

Chaos engineering and chaos testing has become a more popular way to ensure high-quality software while its already in production. This relatively new strategy has made a positive impact on many companies and revolutionized how we test software resilience.

Yet where does it fit in a DevOps environment and who should be in charge of this effort? This article will explore these questions further. To get started, let’s first understand the concept of chaos engineering.

What Is Chaos Testing in Chaos Engineering?

First conceived by Netflix, chaos testing is part of the practice of intentionally trying to harm an application in production.

Chaos engineering enables testers to expand their skills and add value in determining the quality of an application.Why do chaos testing? As agile and DevOps practices dominate development, that type of testing becomes more of a challenge as the software is delivered more quickly and more frequently. The likelihood of defects manifesting themselves in production becomes higher.

The referenced media source is missing and needs to be re-embedded.

Image source: dev.to

One of the characteristics of high-quality software is resiliency. This means that an application can perform acceptably under adverse circumstances. Adverse circumstances constitute things going wrong in the production environment that might bring the application down or seriously degrade performance. It could also involve defects in the application that might crash the application or cause it to generate errors.

Chaos engineering was first conceived of by Netflix, which wanted to ensure that its cloud-based streaming services were robust enough to withstand unexpected failures. This could include network segments failing, data centers going down, distributed denial of service (DDoS) attacks, or another type of failure in production.

Netflix defined chaos engineering as an experiment with a null and alternative hypothesis. It starts with the definition of a normal operating state for the application, then postulates what might happen if a specific failure occurred. The experiment usually involves injecting or simulating a failure into the infrastructure to determine how the application responds.

Netflix has a set of tools, once known as Chaos Monkey but now called the Simian Army, that tests and (in some cases) wreaks havoc on production applications. These tools introduce network delays, cause instances or even entire data center segments to go offline, or identify security vulnerabilities. They also can perform health checks on an application and clean up unused system resources.

For those who work specifically with applications, the Apache-licensed Chaos Toolkit simplifies access to chaos engineering concepts. It provides an API that enables the experimentation approach can be done at different levels: infrastructure, platform but also application.

Chaos Testing Is a DevOps Practice

Using these chaos monkeys to perform effective chaos engineering falls typically under the control of a DevOps engineer. This person on the development or QA team is responsible for defining the scenario, executing the test, and determining and recording the results. They are also responsible for minimizing the customer impact on the production system.

As you might imagine, those goals have the DevOps engineer walking a very fine line. It’s one thing to test the resiliency of a distributed application by trying to make it crash, and another thing to actually have it crash and start affecting customer sessions. Part of the planning for a chaos engineering experiment is knowing when to shut it down if things start going badly.

“Part of the planning for a chaos engineering experiment is knowing when to shut it down if things go badly.”

Chaos engineering is typically associated with DevOps teams, in part because the typical cloud deployment environment can’t easily be replicated in development and test. In short, teams test resiliency in production because it can’t be realistically tested prior to deployment.

Why Testers Must Own Chaos Engineering

However, chaos engineering is also tied to DevOps because of testing. Because of the automated nature of the DevOps workflows, the vast majority of testing is by necessity automated. From unit testing to smoke testing, DevOps is designed to deliver software without a tester touching the build.

This is why testers have to own chaos engineering. It’s certainly not a core testing focus as defined today, and many testers still believe their job is done when an application reaches production. But by contributing to the DevOps toolchain, Chaos Monkeys meet the need for continuous testing. Chaos engineering is testing by any reasonable definition.

While chaos engineering isn’t yet a mainstream practice, it has an enthusiastic following, especially among companies deploying customer-facing applications using DevOps practices. The practice of chaos engineering enables teams to test their applications when it really counts, in the production environment. While resiliency is only a part of a full testing regime, it takes on increasing importance when deploying to the cloud.

For testers, chaos engineering turns out to be much more interesting and relevant than traditional functional testing. It involves elements of both science and art, in that it requires a specific stimulus applied in an engineering manner to the application, but also an appreciation of how far you can push the application without causing heartburn to the business. Chaos engineering enables testers to expand their skills and add value in determining the quality of an application.

Explore Perfecto Scriptless

Explore Perfecto Scriptless, the codeless UI test automation tool for web apps.

See how Perfecto works across mobile, web, and desktop testing, scriptless automation, AI-driven execution, and analytics in these guided demos.

See Perfecto in Action