Search documentation
Dashboard
Fault Injection

Experiments

An experiment is a method of injecting failure into a system in a simple, safe, and secure way. Gremlin provides a range of experiments which you can run against your infrastructure. This includes impacting system resources, delaying or dropping network traffic, shutting down hosts, and more. In addition to running onetime experiments, you can also schedule regular or recurring experiments, create experiment templates, and view experiment reports.

Gremlin provides three categories of experiments:

  • Resource experiments: test against sudden changes in consumption of computing resources
  • Network experiments: test against unreliable network conditions
  • State experiments: test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes

Each experiment tests your resilience in a different way.

Resource Experiments

Resource experiments are a great starting point -- simple to run and understand. They reveal how your service degrades when starved of CPU, memory, IO, or disk space.

ExperimentImpact
CPUGenerates high load for one or more CPU cores.
MemoryAllocates a specific amount of RAM.
IOPuts read/write pressure on I/O devices such as hard disks.
DiskWrites files to disk to fill it to a specific percentage.

State Experiments

State experiments modify the state of a target so you can test auto-correction and similar fault-tolerant mechanisms.

ExperimentImpact
ShutdownPerforms a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines.
Time TravelChanges the host's system time, which can be used to simulate adjusting to daylight saving time and other time-related events.
Process KillerKills the specified process, which can be used to simulate application or dependency crashes. Note: Process experiments do not work for Process ID 1, consider a Shutdown experiment instead.

Network Experiments

Network experiments test the impact of lost or delayed traffic to a target. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.

ExperimentImpact
BlackholeDrops all matching network traffic.
Certificate ExpiryChecks for expiring security certificates.
LatencyInjects latency into all matching egress network traffic.
Packet LossInduces packet loss into all matching egress network traffic.
DNSBlocks access to DNS servers.

Warning: Important considerations for targeting Kubernetes Pods with Network experiments

Network host tags

You can use tags to target IP addresses where traffic should be impacted during network experiments. This is important for today's ephemeral environments where hosts live for a short time and have dynamic IP addresses. As custom tags are used to indicate where an experiment should run, the same tags can be used to indicate the hosts to which network traffic should be impacted. For example, to test latency between serviceA and serviceB, select all clients with the tag service:serviceA when choosing the Hosts to target, and select the tag service:serviceB when configuring the Network experiment. IP addresses assigned to the network interface by the container runtime are also automatically included.

Network providers

Limit the impact of a network experiment to specific external service providers. Select one or many services and their associated region to impact. Gremlin currently supports AWS, Azure, and Datadog services. The destination network configuration is automatically updated daily using these sources: AWS discovery service, Azure service tags.

Network device selection

All network experiments accept a --device argument that refers to the network interfaces to target. Starting with Linux agent version 2.30.0 / Windows agent version 1.9.0, you can specify one or more network interfaces using either a comma-separated list or with multiple --device arguments.

When unspecified, Gremlin targets all physical network interfaces as reported by the operating system. For virtual / cloud machines that typically includes the expected network interfaces like eth0 and eth1 for Linux and Ethernet for Windows.

Device discovery on older agents

Agents before Linux version 2.30.0 / Windows version 1.9.0 use a different strategy described here. All network experiments accept a --device argument that refers to the network interface to target. Gremlin network experiments target only one network interface at a time. When unspecified, Gremlin chooses an interface according to the following order of operations:

  • Gremlin omits all loopback devices (determined by [RFC1122]).
  • Gremlin selects the device with the lowest interface index that starts with eth, en, or for Windows Ethernet.
  • If nothing was found, Gremlin selects the device with the lowest interface index that is non-private (according to [RFC1918]).
  • If nothing was found, Gremlin selects the first device with the lowest interface index.

Experiment stage progression

Every experiment in Gremlin is composed of one or more Executions, where each Execution is an instance of the experiment running on a specific target.

The Stage progression of an experiment is derived from the Stage progression of all of an experiment's Executions. Gremlin weighs the importance of Stages to mark an experiment with the most important Stage of its executions.

Example

An experiment with three Executions will derive its final reported stage by picking the most important stage from among its executions. So, if the three Execution Stages are TargetNotFound, Running, TargetNotFound, the resulting stage for the experiment will be Running.

You can see Stages ordered by their importance in the following section.

Stages

Stages are sorted by descending order of importance (the Running Stage holds the highest importance)

StageDescription
RunningExperiment running on the host
HaltExperiment told to halt
RollbackStartedCode to roll back has started
RollbackTriggeredDaemon started a rollback of client
InterruptTriggeredDaemon issued an interrupt to the client
HaltDistributedDistributed to the host but not yet halted
InitializingExperiment is creating the desired impact
DistributedDistributed to the host but not yet running
PendingCreated but not yet distributed
FailedClient reported unexpected failure
HaltFailedHalt on client did not complete
InitializationFailedCreating the impact failed
LostCommunicationClient never reported finishing/receiving execution
ClientAbortedSomething on the client/daemon side stopped the Gremlin and it was aborted without user intervention
UserHaltedUser issued a halt, and that is now complete
SuccessfulCompleted running on the Host
TargetNotFoundExperiment not scoped to any current targets

Scheduling experiments

Experiments can be run ad-hoc or scheduled, from the Web App or programmatically. You can schedule experiments to execute on certain days and within a specified time window. You can also set the maximum number of experiments a schedule can generate.

Running experiments on Kubernetes objects

Gremlin allows targeting objects within your Kubernetes clusters. After selecting a cluster, you can filter the visible set of objects by selecting a namespace. Select any of your Deployments, ReplicaSets, StatefulSets, DaemonSets, or Pods. When one object is selected, all child objects will also be targeted. For example, when selecting a DaemonSet, all of the pods within will be selected.

Selecting containers

For State and Resource experiment types, you can target all, any, or specific containers within a selected pod. Once you select your targets, these options will be available under Choose a Gremlin on the Experiment page. Selecting Any will target a single container within each pod at runtime. If you've selected more than one target (for example, Deployment), you can select from a list of common containers across all of these targets. When you run the experiment, the underlying containers within the objects selected will be impacted.

Any, all, or specific options for container experiments

Monitoring experiments in real time

You can observe your environments in real-time in Gremlin for CPU or Shutdown experiments, to quickly verify the effect of your experiments. For CPU experiments, you can see the statistics for CPU load; for Shutdown experiments, you can see machine uptime.

Monitor CPU experiments in real time

Enabling Experiment Visualizations

Company Admins and Owners can turn this feature on for their company by visiting the Company Settings, clicking Settings, and toggling Experiment Visualizations on. Only data relevant to the experiment is collected and no data is collected when experiments are not running.

Overriding Experiment Visualizations for a host

To prevent any host from sending metrics to populate experiment visualization charts, add PUSH_METRICS="0" to the configuration for 'gremlind' on that host. This will override the company preference and will prevent that particular host from sending metrics.

Parameter reference

For details on parameters supplied to individual experiments, check out the links to the individual experiment pages at the beginning of this page.

Include new targets in ongoing experiments

When selecting targets by tag, you have the option to check the Include New Targets checkbox. When checked, if Gremlin detects a new target that meets the experiment's selection criteria, it will distribute the experiment to the target. By default, new targets will not run the experiment even if they match the selection criteria.

For example, imagine you select all EC2 hosts in the AWS us-east-1 region for a CPU experiment. When you run the experiment, AWS detects the increased CPU usage and automatically provisions a new EC2 instance and installs the Gremlin agent. If Include New Targets is checked, Gremlin will add this new instance to the ongoing CPU experiment.

Multiple values

Port and address options can be used multiple times in a single command.

bash
1# Run a latency experiment on both DynamoDB and database.mydomain.org
2gremlin attack latency -h dynamodb.us-west-1.amazonaws.com -h database.mydomain.org

Alternatively, a , can also be used to specify multiple values.

bash
1gremlin attack latency -p 8080,443

Exclude rules

A ^ can be used before a port or address to exclude that argument from the set of impacted network targets.

bash
1# Slow down all ports except DNS port
2gremlin attack latency -p ^53

This can be particularly useful for excluding a specific IP from a range that is otherwise impacted by the experiment.

bash
1# Blackhole all hosts in 10.0.0.0/24 except for 10.0.0.11
2gremlin attack blackhole -i 10.0.0.0/24 -i ^10.0.0.11