A Little Chaos Now and Then is the Best Test for Resilience

Resilience is “the capacity to recover quickly from difficulties, or toughness.” With the rise in both natural disasters and cyber threats, today’s businesses must ensure not only their physical resilience, but the resilience of their IT systems so they can continually provide a great customer experience.

How do you know if you’re prepared for the worst? It’s all about testing. In fact, there is one method of testing known as “chaos engineering” which is defined as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

The goal of chaos testing is to expose weaknesses in your systems before they manifest themselves in the crash or unavailability of an end-user service. By doing this on purpose, a business and its systems become better at handling unforeseen failures.

Typically, we don’t look for a service’s complete failure, or for high latency in a service’s response. Couple this with the fact that almost all modern IT systems are very distributed in nature, and we have other issues like cascading failures that are very hard to foresee from a test team’s perspective.

Planning your approach to chaos experiments
The initial stages of your foray into introducing failures/chaos into your organization play a vital role in ensuring success. There are a few key areas to consider before starting the journey:

Know your application’s architecture and steady state metrics
Work with non-critical services that have a good steady state defined
Apply either an opt-out or opt-in (less aggressive) model for the delivery teams
Provide ways to evangelize your experiments with teams in QA environments
Have the necessary fall backs in place (circuit breakers, for example) and verify if they triggered as expected
After the experiment, ensure you are measuring and comparing against the known ‘steady state’ and becoming better (for example, aiming for a lower MTTR – Mean Time to Recover); run the tests again to measure

Your goal is to slowly move towards automated chaos on the service in question. From here, you can move into more specific experiments. For example, if you are doing failovers, create experiments where a specific business critical platform comes back up with a key piece missing.

Consider a situation where a messaging/streaming platform fails over but with a topic missing, or with just half its intended capacity. Determine whether or not the system can handle this — or does it fail.

You can take this one step further by looking for any cascading impacts your failure might have. In the messaging example, maybe this fails your loan application intake process, your payment processing or your checkout process. None of this can be clearly predicted until the experimentation phase.

One key thing to remember is that in order to be successful in testing for cascading failures and addressing them in QA, you should have the necessary service teams’ reps participating in these experiments.

Get your testing regime up and running
A simple way to begin your testing regime is by looking at recent production issues and discovering whether you could have caught any of those problems by experimenting earlier on. Many traditional enterprises have a problem management group that can help to spearhead this discussion, or you can check with your DevOps/Service team(s).

Some IT organizations introduce system degradation using a tool like Chaos Monkey, which was invented by Netflix in 2011 to gauge the resilience of its IT infrastructure.

Remember that your goal is not to cause problems, but to reveal them. Be careful not to overlook the type and amount of traffic being created by your tests. Tools like the Chaos Automation Platform (ChAP, another test bed built within Netflix) provide ways to route a percentage of your internet traffic to the experiment and thereby help ‘increase the safety, cadence, and breadth of experimentation.’

Resilience maturity
While chaos experiments are very useful, one current limitation is the amount of upfront time involved in meeting and planning with different teams and finding good use cases and faults to inject into services.

The industry and best practices are maturing as new algorithms are being tested to automate the identification of the right services to run experiments. This can help reduce and eliminate the upfront meeting times and automate the finding of more critical flaws early on, before they surface as a production issue or customer complaint.

Werner Vogels, Amazon’s CTO is notorious for his quote “everything fails all the time”. This is even more true in the elastic cloud environment with applications architected on immutable infrastructure. So, the culture of asking “What happens if this fails?” needs to shift to “What happens when this fails?”

A Little Chaos Now and Then is the Best Test for Resilience

Kiran Chitturi

You may also like

Linkedin social engineering test snares 68% of users

Information security technology not enough

Life Of: A Pen Test Report Writer

Will Employees' Lack of IT Skills Lead to Security Issues in Future?

No Silver Bullet for GDPR Compliance

What’s hot on Infosecurity Magazine?

Most IT Leaders Say Severity of Cyber-Attacks has Increased

Chinese Espionage Group Upgrades Malware Arsenal to Target All Major OS

Russia Shifts Cyber Focus to Battlefield Intelligence in Ukraine

Exclusive: Paris 2024 CISO Reveals Cybersecurity Plans for the Olympics

Prolific DDoS Marketplace Shut Down by UK Law Enforcement

Cybercriminals Exploit CrowdStrike Outage Chaos

Fact vs. Fiction: Dispelling Zero Trust Misconceptions

Cybercriminals Exploit CrowdStrike Outage Chaos

Exclusive: Paris 2024 CISO Reveals Cybersecurity Plans for the Olympics

CISA's Jack Cable Discusses US Push for More Secure Software

Chinese Espionage Group Upgrades Malware Arsenal to Target All Major OS

North Korean Hackers Targeted Cybersecurity Firm KnowBe4 with Fake IT Worker

The Future of Fraud: Defending Against Advanced Account Attacks

Mastering IP & Data Security in the Industrial Age

Experiencing a DDoS Simulation to Enhance Defenses

How to Unlock Frictionless Security with Device Identity & MFA

Adapting to Tomorrow's Threat Landscape: AI's Role in Cybersecurity and Security Operations in 2024

How to Proactively Remediate Rising Web Application Threats

#Infosec2024: Claire Williams on Leadership, Cultivating a High Performing Team and Overcoming Adversity (video)

#Infosec2024: Navigating the Ransomware Toll on Victims with Jason Nurse (video)

#Infosec2024: Experts Share How CISOs Can Manage Change as the Only Constant

#Infosec2024: 104 EU Laws Have Different Definitions of Cybersecurity

Infosecurity Magazine Autumn Online Summit 2024: Day Two

Infosecurity Magazine Autumn Online Summit 2024: Day One

A Little Chaos Now and Then is the Best Test for Resilience

Written by

You may also like

What’s hot on Infosecurity Magazine?