Adopt Chaos Engineering to Preserve System Resilience

No matter what industry you’re in, there's likely a digital component that is mission critical to your business — a simple website, perhaps, or possibly an extensive mobile app. Whatever form this component may take, these systems are expressions of the business and are part of the customer experience.

The same is true for the internal employee experience. Digital tools are a part of a normal work environment, and employees are much less productive when the tools aren’t working.

Test System Resilience With Chaos Engineering

Given how important digital assets are, it is incumbent on IT to maintain hardware and software infrastructure that is reliable and resilient. Otherwise, systems may experience disastrous failures at inopportune times and take digital assets offline. (Are there any opportune times for widespread system failure?) Most modern IT architectures attempt to create systems that embrace resilience and reliability. But do they succeed? There’s only one way to know — test the system.

Chaos engineering tries to do just this. It’s an approach to testing system resilience by using a set of random failures in a production environment to ensure the failovers and redundancies built into the system work as designed. It starts with the idea of a steady state, which is defined as the normal behavior of a system. Test engineers then make changes in the system that mimic real-world failures. It could be something as catastrophic as a major hardware failure or something simpler, like a container with a microservice shutting down abruptly.

Once the failure is introduced, the outcome is immediately measured and compared to the steady state. If the system does not respond as expected, the “failure” can be reversed and the system brought to normal.

Set Loose the Chaos Monkey

One major tenet of chaos engineering is to “minimize the blast radius.” To do that, test engineers need to plan the failure so as to minimize impact on customers or internal company processes. For example, running failure tests on accounting systems at the end of the month is probably not a good idea; doing that would certainly maximize damage.

One of the pioneers in chaos engineering is Netflix, which employs chaos engineering principles to ensure its infrastructure is capable of supporting its high-performance, always-on service. As part of that endeavor, it created a software product called Chaos Monkey. Available under Apache License 2.0, Chaos Monkey can terminate containers and virtual machines randomly to simulate failures of those common software components.

Preserve Your System's Integrity

Like CI/CD and infrastructure as code, chaos engineering is an attempt to manage the new levels of scale and complexity in modern IT systems. In a way, it’s similar to using a controlled burn in forest management — creating a small failure before a big failure happens. Better a controlled failure than an unexpected and uncontrolled disaster.

You should consider adopting chaos engineering as a way to preserve your mission-critical systems, especially those that deliver vital digital assets to customers and employees.

fa-solid fa-hand-paper Learn how you can join our contributor community.

Test System Resilience With Chaos Engineering

Set Loose the Chaos Monkey

Preserve Your System's Integrity

About the Author