Failure Flags

Overview

Gremlin Failure Flags lets you run Chaos Engineering experiments and reliability tests on serverless workloads, containers, and similar managed environments. Just like feature flags, Failure Flags let you perform experiments on specific parts of your services and applications with minimal impact to your application code and no performance impact when disabled. Failure Flags are safe to deploy in your application and will default to disabled when you have no actively running experiments.

Use-Cases

Failure Flags is an application level fault injection tool and its use-cases cover simulating or realizing those failures in your system that either have impact at the application level or target application data. These typically represent the bulk of the issues teams see day-to-day. Issues like:

Incorrect or corrupt data
Customer-specific failures
Lock-contention on hot data
Breaking API changes
Unexpected API responses
Partial service failures
Message double-delivery or ordering issues

But more than testing issues, Failure Flags can help you:

Test observability and alarm configuration
Exercise automated recovery systems
Isolate experiments in any environment to well-knows users or customers

Architecture and Performance Impact

Failure Flags involves integration with your applications and for that reason it is critical that you can be confident that adopting this technology will not adversely affect either the availability or performance of those applications outside of experiment parameters. Failure Flags - like other Gremlin products - is designed to fail safely.

Failure Flags is made up of three major components: the Gremlin SaaS API, the Failure Flags Sidecar or Lambda Extension, and one of the SDKs. No impact to your applications is possible unless all three are configured correctly at runtime. Working backwards from your application:

The SDK must be integrated with your application and explicitly enabled via environment variable.
The sidecar or extension must be deployed with your application and use a common localhost interface.
The sidecar or extension must be enabled and provided with current credentials to the Gremlin API via environment variables or other configuration options.
The sidecar or extension must have a stable network route to the Gremlin API and be provided with configuration required to traverse corporate proxies.
Your company Gremlin account must have Failure Flags enabled.
Your team must have created and run an experiment.

Any misconfiguration, configuration omission, or service outage can only prevent experimentation and will minimize any adverse impact to your applications. Further, the various Failure Flags SDKs are published under the Apache-2.0 license. You're encouraged to audit those libraries as you see fit. Adopting Failure Flags will in no way lock-in your applications to Gremlin.

Takeaways

It is safe to add Failure Flags to your code and leave them there
It is easy to prevent experimentation in any environment
The SDKs are licensed under Apache-2.0
Adding Failure Flags will not create lock-in

Preparing and Next Steps

In order to prepare for the Failure Flags demo you should reach out to your Cloud or Platform Engineering Team to gather the following information:

If running in a VPC or private network, is there a proxy server needed for a Lambda to communicate to the Internet (specifically api.gremlin.com)? If so, please provide the proxy address.
Are there firewall rules or network security changes required for the application or service to connect with the Gremlin API (api.gremlin.com)?

See the following pages to get started: