Platform

Managing the Gremlin Agent

The Gremlin Agent is an executable binary installed on a host operating system, container runtime, or Kubernetes cluster. It maintains a heartbeat connection to the Gremlin Control Plane to let Gremlin know that the host is active and able to receive orders, such as initiating a reliability test or injecting fault. The agent only requires an outbound network connection to the Gremlin Control Plane, letting you run it behind a firewall without opening inbound ports. All traffic is encrypted.

Agent lifecycle

When an agent is installed and authenticated, it appears as "Active" in the Agents list. It also identifies any targets for fault injection, such as hosts or containers.

You can only run experiments on "active" Gremlin Agents. An Agent goes into an "idle" state if the Gremlin Control Plane detects no activity for at least 5 minutes. You cannot run or schedule experiments on idle Agents. If Gremlin does not hear from these idle Agents for a period of 24 hours, the Agents are removed from the list. However, if an Agent starts communicating with Gremlin again while still within the 24 hour idle window, the Agent is reactivated and returned to the "active" state.

Logs

Logs can be found under the /var/log/gremlin directory. Agent logs can be found in the daemon.log file. Log entries in this file may indicate events where the Gremlin Agent is not able to communicate with the Control Plane.

Each fault injection performed by the Agent is logged under /var/log/gremlin/executions using its unique experiment execution ID. This is useful for troubleshooting experiments that do not complete.

Log size

To see how much disk space is being used by logs, run the du utility on the /var/log/gremlin directory:

shell

1du -sh /var/log/gremlin

Bandwidth usage

Idle state

The Gremlin Agent uses very little bandwidth in its idle state. In testing over a 5 minute period, the Agent sent a total of 11.3KB and received 24.8KB—an average combined bandwidth of 0.12KB/s.

Attack state

There is a slight increase in overall bandwidth consumption during experiments. While experiments are being executed, the Agent stays in constant communication with the Control Plane as it checks for the abort condition to be executed. The bandwidth used is not affected by the type of experiment being run. In testing over a 5 minute period, the Agent sent a total of 112.3KB and received 114.0KB—an average combined bandwidth of 0.75KB/s.

Process Collection

When Process Collection is enabled, the Gremlin Agent will send additional data and the bandwidth consumed will depend on how many processes are discovered. The information is gzip compressed in order to minimize network consumption. To measure the actual bandwidth consumed by Gremlin for your particular installation, we recommend using a tool such as iptraf or nethogs.