Reliability Management

Detected Risks

Detected Risks are high-priority reliability concerns that Gremlin automatically identified in your environment. These risks can include misconfigurations, bad default values, or reliability anti-patterns. Gremlin prioritizes these risks based on severity and impact for each of your services. This gives you near-instantaneous feedback on risks and action items to improve the reliability and stability of your services.

This guide is also available as a video:

Viewing Detected Risks in Gremlin

Gremlin provides a visual indicator of the number of Detected Risks on the Service Catalog view, as well as on the service details page. Detected Risks are shown in a separate indicator next to the reliability score.

The details page for a service showing a reliability score of 12% and 3 Detected Risks

Click on this indicator to see a list of all potential Detected Risks for your service. Each risk will show one of three statuses:

At Risk: This risk is currently present in your systems and hasn't been addressed.
Mitigated: This risk has been fixed since it was last detected.
N/A: This risk could not be evaluated. A warning tooltip will be shown next to the risk with more details.

Clicking on a risk name provides additional information about the risk, including guidance on how to fix it.

A list of Kubernetes reliability risks automatically detected by Gremlin

Once you've addressed a risk, refresh the page to confirm that it's been mitigated.

Kubernetes Detected Risks

In a Kubernetes environment, Gremlin will detect the following set of risks:

CPU Requests
Liveness Probes
Availability Zone Redundancy
Memory Requests
Memory Limits
Application Version Uniformity
CrashLoopBackOff
ImagePullBackOff

CPU Requests

What is this?

spec.containers[].resources.requests.cpu specifies how much CPU should be reserved for your pod container.

Why is this a risk?

The kubelet reserves at least the request amount of that system resource specifically for that container to use.
This protects your node from resource shortages and helps to schedule pods on nodes that can accommodate the requested resource amount.

How can I fix this?

Specify an appropriate resource request for your pod container. Think of this as the minimum amount of the resource needed for your application to run.

How does this work?

Gremlin will consider the absence of a container's resource request "at-risk".

Liveness Probe

What is this?

spec.containers[].livenessProbe specifies how the kubelet will decide when to restart your pod container.

Why is this a risk?

The kubelet uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.

How can I fix this?

Implement a livenessProbe for your pod container, such that it fails when your container needs restarting.

How does this work?

Gremlin will consider the absence of a container's livenessProbe "at-risk".

Availability Zone Redundancy

What is this?

Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.

Why is this a risk?

Availability zone redundancy ensures your applications continue running, even in the event of critical failure within a single zone.
Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.

How can I fix this?

If you are running in a single availability zone now, you should deploy your service to at least one other zone.
For a Kubernetes service, once your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better expected availability, reducing the risk that a correlated failure affects your whole workload.
For a Kubernetes service, you can apply node selector constraints to Pods that you create, as well as to Pod templates in workload resources such as Deployment, StatefulSet, or Job.

How does this work?

Gremlin will perform zone redundancy analysis similar to how it generates targeting for a zone failure test: Identify all unique zone tags among the Gremlin agents that are co-located with the given service.
A service with one or no values for zone are considered "at-risk".

Memory Request

What is this?

spec.containers[].resources.requests.memory specifies how much Memory should be reserved for your Pod container.

Why is this a risk?

The kubelet reserves at least the request amount of that system resource specifically for that container to use. This protects your node from resource shortages and helps to schedule pods on nodes that can accommodate the requested resource amount.

How can I fix this?

Specify an appropriate resource request for your pod container. Think of this as the minimum amount of the resource needed for your application to run.

How does this work?

Gremlin will consider the absence of a container's resource request "at-risk".

Memory Limit

What is this?

spec.containers[].resources.limits.memory specifies a maximum amount of memory your Pod container can use.

Why is this a risk?

Specifying a memory limit for your pod containers protects the underlying nodes from applications consuming all available memory.
The memory limit defines a memory limit for that cgroup. If the container tries to allocate more memory than this limit, the Linux kernel out-of-memory subsystem activates and, typically, intervenes by stopping one of the processes in the container that tried to allocate memory. If that process is the container's PID 1, and the container is marked as restartable, Kubernetes restarts the container.

How can I fix this?

Specify an appropriate memory limit for your Pod container.

How does this work?

Gremlin will consider the absence of a container's memory limit "at-risk".

Application Version Uniformity

What is this?

Whether your application is configured to ensure all of its replicas are running the exact same version.

Why is this a risk?

Version uniformity ensures your application behaves consistently across all instances.
Image tags such as latest can be easily modified in a registry. As application pods redeploy over time, this can produce a situation where the application is running unexpected code.

How can I fix this?

Specify an image tag other than latest, ideally using the complete sha256 digest which is unique to the image manifest.

How does this work?

Gremlin will consider the presence of more than one image version running within your service as "at-risk".

CrashLoopBackOff

What is this?

CrashLoopBackOff is a Kubernetes state that indicates a restart loop is happening in a pod. It’s a common error message that occurs when a Kubernetes container fails to start up properly for some reason, then repeatedly crashes.

Why is this a risk?

CrashLoopBackOff is not an error in itself—it indicates there’s an error happening that causes the application to crash. A CrashLoopBackoff error also indicates that a portion of your application fleet is not running and usually means your application fleet is in a degraded state.

How can I fix this?

Fixing this issue will depend on identifying and fixing the underlying problem(s).

Examine the output or log file for the application to identify any errors that lead to crashes.
Use kubectl describe to identify any relevant events or configuration that contributed to crashes.

How does this work?

Gremlin considers a service as "at-risk" when it finds at least one containerStatus in a state of waiting with reason=CrashLoopBackoff.

ImagePullBackOff

What is this?

Kubernetes pods sometimes experience issues when trying to pull container images from a container registry. If an error occurs, the pod goes into the ImagePullBackOff state. The ImagePullBackOff error occurs when the image path is incorrect, the network fails, or the kubelet does not succeed in authenticating with the container registry. Kubernetes initially throws the ErrImagePull error, and then after retrying a few times, "pulls back" and schedules another download attempt. For each unsuccessful attempt, the delay increases exponentially, up to a maximum of 5 minutes.

Why is this a risk?

An ImagePullBackOff error means a portion of your application fleet is not running, and cannot download the image required to start running. This usually means your application fleet is in a degraded state.

How can I fix this?

In most cases, restarting the pod and deploying a new version will resolve the problem and keep the application online. Otherwise:

Check that your pod specification is using correct values for image’s registry, repository, and tag.
Check for network connection issues with the image registry. You can also forcibly recreate the pod to retry an image pull.
Verify your pod specification can properly authenticate to the targeted container registry.

How does this work?

Gremlin considers a service as "at-risk" when it finds at least one containerStatus in a state of waiting with reason=ImagePullBackoff.