Chaos Engineering is a practice that aims to help us improve our systems by teaching us new things about how they operate. It involves injecting faults into systems (such as high CPU consumption, network latency, or dependency loss), observing how our systems respond, then using that knowledge to make improvements.
To put it simply, Chaos Engineering identifies hidden problems that could arise in production. Identifying these issues beforehand lets us address systemic weaknesses, make our systems fault-tolerant, and prevent outages in production.
Chaos Engineering goes beyond traditional (failure) testing in that it's not only about verifying assumptions. It helps us explore the unpredictable things that could happen, and discover new properties of our inherently chaotic systems.
Chaos Engineering as a discipline was originally formalized by Netflix. They created Chaos Monkey, the first well-known Chaos Engineering tool, which worked by randomly terminating Amazon EC2 instances. Since then, Chaos Engineering has grown to include dozens of tools used by hundreds (if not thousands) of teams around the world.
Chaos Engineering involves running thoughtful, planned experiments that teach us how our systems behave in the face of failure. These experiments follow three steps:
You start by forming a hypothesis about how a system should behave when something goes wrong. For example, if your primary web server fails, can you automatically failover to a redundant server?
Next, create an experiment to test your hypothesis. Make sure to limit the scope of your experiment (traditionally called the blast radius) to only the systems you want to test. For example, start by testing a single non-production server instead of your entire production deployment.
Finally, run and observe the experiment. Look for both successes and failures. Did your systems respond the way you expected? Did something unexpected happen? When the experiment is over, you’ll have a better understanding of your system's real-world behavior.
Chaos Engineering is often called “breaking things on purpose,” but the reality is much more nuanced than that. Think of a vaccine or a flu shot, where you inject yourself with a small amount of a potentially harmful foreign body in order to build resistance and prevent illness. Chaos Engineering is a tool we use to build such an immunity in our technical systems. We inject harm (like latency, CPU failure, or network black holes) in order to find and mitigate potential weaknesses.
These experiments also help teams build muscle memory in resolving outages, akin to a fire drill (or changing a flat tire, in the Netflix analogy). By breaking things on purpose, we surface unknown issues that could impact our systems and customers.
According to the 2021 State of Chaos Engineering report, the most common outcomes of Chaos Engineering are increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs shipped to product, and fewer outages. Teams who frequently run Chaos Engineering experiments are more likely to have >99.9% availability.
Distributed systems are inherently more complex than monolithic systems. It’s hard to predict all the ways they might fail. The eight fallacies of distributed systems shared by Peter Deutsch and others at Sun Microsystems describe false assumptions that engineers new to distributed applications invariably make:
Many of these fallacies drive the design of Chaos Engineering experiments such as “packet-loss attacks” and “latency attacks”:
We need to test and prepare for each of these scenarios.
Regardless of whether your applications live on-premise, in a cloud environment, or somewhere in between in a hybrid state, you’re likely familiar with the struggles and complexities of scaling environments and applications. All engineers eventually must ask themselves: “Can my application and environment scale? And if we attract the users the business expects, will everything work as designed?”
For decades, enterprises used Performance Engineering to put their systems to the test in preparation for increased demand. Solutions like Micro Focus LoadRunner Professional and open source offerings like JMeter have helped engineers ensure proper performance and scaling of their systems to meet customer (and business) expectations. It’s the engineering team’s responsibility to validate that a system can handle an influx of users for peak events such as Cyber Monday or a big promotional sale.
But often, when we test performance, we test it in a stable environment. These performance tests are usually run under ideal conditions different from real world conditions. There aren’t any service issues, regional outages or thousands of other complexities found within on-premise or to complicate it further, cloud-native environments.
Simply put, scaling is incomplete without coupling scaling with resilience. It won’t mean much if your systems can scale, but they’re offline. The important question then becomes: “I know my application can handle 50k users, but can it handle these 50k users amidst a critical infrastructure outage or with the outage of a dependent service?
Let’s use a simple analogy of building the World’s Tallest Building, The Burj Khalifa in Dubai, which stands at a staggering 2,717ft. We could equate performance engineering with the ability to make this the tallest building in the world. But a tall building that nobody can access or that falls over in high winds isn't very impressive.
Reliability and resiliency are just as important as performance. Look at how many other features engineers built into the tower to account for earthquakes, high winds, and failures in other portions of the building.
Chaos Engineering and Performance Engineering are different. However, they’re complementary rather than exclusionary. Companies who adopt both not only have the ability to scale but scale in a way that keeps resiliency top of mind.
This dual approach lets engineers reassure the business while providing a great customer experience. The benefits include a reduction in incidents, higher availability numbers, and more robust, scalable systems.
Prior to your first Chaos Engineering experiments, it’s important to collect a specific set of metrics. These metrics include infrastructure monitoring metrics, alerting/on-call metrics, high severity incident (SEV) metrics, and application metrics.
If you don’t collect metrics prior to practicing your Chaos Engineering experiments, you can’t measure whether they’ve succeeded. You also can’t define your success metrics and set goals for your teams.
When you do collect baseline metrics, you can answer questions such as:
To ensure you have collected the most useful metrics for your Chaos Engineering experiments, you need to cover the following:
Infrastructure monitoring software will enable you to measure trends, disks filling up, cpu spikes, I/O spikes, redundancy, and replication lag. You can collect appropriate monitoring metrics using monitoring software such as Datadog, New Relic and SignalFX.
You should aim to collect the following infrastructure metrics:
Collecting these metrics involves two simple steps:
Alerting and on-call software will enable you to measure total alert counts by service per week, time to resolution for alerts per service, noisy alerts by service per week (self-resolving) and the top 20 most frequent alerts per week for each service.
These metrics include:
Software applications that collect alerting and on-call metrics include PagerDuty, VictorOps, OpsGenie and Etsy OpsWeekly.
Establishing a High Severity Incident Management (SEV) program will enable you to create SEV levels (e.g. 0, 1, 2 and 3), measure the total count of incidents per week by SEV level, measure the total count of SEVs per week by service and the MTTD, MTTR and MTBF for SEVs by service.
Important SEV metrics include:
Observability metrics will enable you to monitor your applications. This is particularly important when you are practicing application-level failure injection. Software that collects application metrics includes Sentry and Honeycomb.
Collecting these metrics involves two steps:
Out of the box SDKs will attempt to hook themselves into your runtime environment or framework to automatically report fatal errors. However, in many situations, it’s useful to manually report errors or messages.
Ready to demonstrate your understanding of the Chaos Engineering fundamentals? Get your free Gremlin Certified Chaos Engineering Practitioner (GCCEP) certification.
Technology organizations in regulated industries have strict, often complex requirements for availability, data integrity, etc. Chaos Engineering helps ensure that your systems are fault tolerant by letting you test key compliance aspects, such as disaster recovery plans and automatic failover systems.
Use Chaos Engineering to:
Learn more in our article: Using Chaos Engineering to demonstrate regulatory compliance.
Tuning today’s complex applications is becoming increasingly challenging, even for experienced performance engineers. This is due to the huge number of tunable parameters at each layer of the environment. Adding to this complexity, these systems often interact in counterintuitive ways. Likewise, they may behave under special workloads or circumstances in such a way that vendor defaults and best practices become ineffective, or worse, negatively impact resilience.
Chaos Engineering uncovers unexpected problems in these complex systems, verifies that fallback and failover mechanisms work as expected, and teaches engineers how to best maximize resilience to failure.
One of the main use cases for Chaos Engineering is ensuring that your technology systems and environment can withstand turbulent or unfavorable conditions. Failures in critical systems like load balancers, API gateways, and application servers can lead to degraded performance and outages. Running Chaos Engineering experiments validates that your systems and infrastructure are reliable so that developers can feel confident deploying workloads onto them.
Organizations with high availability requirements will often create a disaster recovery (DR) plan. Disaster recovery is the practice of restoring an organization’s IT operations after a disruptive event, like an earthquake, flood, or fire. Organizations develop formal procedures for responding to events (disaster recovery plans, or DRPs) and practice those plans so engineers can respond quickly in case of an actual emergency.
Chaos Engineering lets teams simulate disaster-like conditions so they can test their plans and processes. This helps teams gain valuable training, build confidence in the plan, and ensure that real-world disasters are responded to quickly, efficiently, and safely.
Chaos Engineering helps businesses reduce their risk of incidents and outages. Outages can result in lost revenue due to customers not being able to use the service. They also give businesses a competitive advantage by making availability a key differentiator.
In highly regulated industries like financial services, government, and healthcare, poor reliability can lead to heavy fines. Chaos Engineering helps avoid these fines, as well as the high-profile stories that usually accompany them. Chaos Engineering also helps accelerate other practices designed to identify failure modes, such as failure mode and effects analysis (FMEA). The result is a more competitive, more reliable, less risky business.
Engineers benefit from the technical insights that Chaos Engineering provides. These can lead to reductions in incidents, reduced on-call burden, better understanding of system design and system failure modes, faster mean time to detection (MTTD), and a reduction in high severity (SEV-1) incidents.
Engineers gain confidence in their systems by learning how they can fail and what mechanisms are in place to prevent them from failing. Engineering teams can also use Chaos Engineering to simulate failures and react to those failures as though they were real production incidents (these are called GameDays). This lets teams practice and improve their incident response processes, runbooks, and mean time to recovery (MTTR).
It’s true that Chaos Engineering is another practice for engineers to adopt and learn, which can create resistance. Engineers often need to build a business case for why teams should adopt it. But the results benefit the entire organization, and especially the engineers working on more reliable systems.
All of these improvements to reliability ultimately benefit the customer. Outages are less likely to disrupt customers’ day-to-day lives, which makes them more likely to trust and use the service. Customers benefit from increased reliability, durability, and availability.
Chaos Engineering practices apply to all platforms and cloud providers. At Gremlin, we most often see teams apply Chaos Engineering to AWS, Microsoft Azure, and Kubernetes workloads.
In 2020, AWS added Chaos Engineering to the reliability pillar of the Well-Architected Framework (WAF). This shows how important Chaos Engineering is to cloud reliability. Chaos Engineering helps ensure resilient AWS deployments and continuously validates your implementation of the WAF.
Microsoft Azure is the second largest cloud provider after AWS. Windows is the leading operating system for servers. Chaos Engineering ensures these systems’ reliability by testing for risks unique to Windows-based environments, such as Windows Server Failover Clustering (WSFC), SQL Server Always On availability groups (AG), and Microsoft Exchange Server back pressure. It also ensures that your Azure workloads are resilient.
Kubernetes is one of the most popular software deployment platforms. But it has a lot of moving parts. For unprepared teams, this complexity can result in unexpected behaviors, application crashes, and cluster failures. Teams using, adopting, or planning a Kubernetes migration can use Chaos Engineering to ensure they’re ready for whatever risks a production Kubernetes deployment can throw at them.
If your team is new to Kubernetes, read why if you’re adopting Kubernetes, you need Chaos Engineering. Or, if your team is already on Kubernetes, learn how to run your first 5 Chaos Engineering experiments on Kubernetes.
When you’re ready to take the next step into adopting Chaos Engineering, there’s a process that will maximize your benefits.
Much like QA testing or performance testing, Chaos Engineering requires you to make an assumption about how your systems work (a hypothesis). From there, you can construct a testing scenario (called an experiment), run it, and observe the outcomes to determine whether your hypothesis was accurate.
When mapped out, the process looks like this:
If the experiment reveals a failure mode, address it and re-run the experiment to confirm your fix. If not, consider scaling up the experiment to a larger blast radius to make sure your systems can withstand a larger scale impact.
If something unexpected happened—like a failure in a seemingly unrelated system—reduce the scope of your experiment to avoid unintended consequences. Repeat this process by coming up with other hypotheses and testing other systems.
As you go, you’ll get better at running Chaos Engineering experiments. Your systems will become more reliable.
Of course, it’s not enough to hand engineers a new tool and a quick-start guide to using it. Introducing any new tool or practice is difficult on its own, not to mention a relatively unfamiliar practice like Chaos Engineering. It’s important for engineers to know the how and the why behind Chaos Engineering so they can be most effective at using it.
You may also need to create incentives to encourage engineers to integrate Chaos Engineering into their day-to-day workflows, respond to pushback, and encourage other teams in the organization to also adopt Chaos Engineering practices. To learn more, we created a comprehensive guide on how to train your engineers in Chaos Engineering.
Now that you understand the “how” of getting started with Chaos Engineering, join our Slack channel to learn from and connect with other engineers using Chaos Engineering. You can also watch our webinar—Five hidden barriers to Chaos Engineering success—to learn how to avoid the most common pitfalls in adopting Chaos Engineering.
For financial services, Chaos Engineering helps:
Tech Business Management (TBM) is a collaborative decision-making framework meant to unite finance, business, and IT in business decisions. It defines ten core tenets for organizations to implement, with the intention of helping the organization drive IT decisions based on overall business needs like reliability and customer satisfaction. The overall goals of TBM are to communicate the value of IT spending to business leaders, and reduce IT costs without sacrificing vital services.
Chaos Engineering supports TBM goals in several ways: