Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways:
- Reductions in high-severity incidents, incident response times, and downtime.
- Fewer bug reports or customer support tickets.
- More successful production deployments.
These are all great outcomes, but they share a common problem: the incidents, bugs, and support tickets that drive them must happen before we can measure them. When the goal of improving reliability is to prevent these problems, this method of measuring success becomes counter-productive. That said, how are we supposed to show the value of proactive reliability work when all of our measurements are based on reactive values? This is where Reliability Management helps.
In this blog, we explain what Reliability Management is, how it helps you improve service reliability, and how it helps demonstrate the value of your reliability efforts.
What is Reliability Management, and why is it necessary?
Reliability Management is a standards-based approach to baseline, remediate, and automate the reliability of complex, distributed systems. To really understand what it is and the problems it solves, we need to start with a broader question: why is reliability necessary?
Reliability Management is a standards-based approach to baseline, remediate, and automate the reliability of complex, distributed systems.
The answer is simple. Customers expect services to always be available and performant, especially if those services support critical functions like banking, travel, and healthcare. Customers who experience downtime are also more likely to switch service providers, creating a competitive advantage for companies prioritizing reliability. In addition, companies that demonstrate high reliability spend less time troubleshooting and fixing problems, which lets engineers focus on more desirable work, like building new features and optimizing performance.
Of course, there's a challenge to this. Technical systems are larger than ever, infrastructure is growing more complex by the day, engineers are automating more processes throughout the software lifecycle, and competitors are racing to outperform each other in the marketplace. These factors make testing and improving reliability harder since there are so many moving parts to consider.
Earlier, we mentioned how Site Reliability Engineering, observability, and Chaos Engineering emerged to address reliability challenges. While these approaches have their benefits, none of them provide a comprehensive solution:
- Site Reliability Engineering puts dedicated engineering time towards reliability, but SREs are difficult to hire and expensive to scale.
- Observability and Incident Response aim to minimize the time from when a problem is detected to when it's resolved, but they're both reactive approaches that only kick in after an incident has already happened.
- Chaos Engineering takes a more proactive approach by letting you simulate failure modes before they happen, but it's hard for teams to learn and adopt and even harder to measure success.
Reliability Management addresses these shortcomings by providing a standards-based approach to measuring and improving service reliability before incidents occur. With ready-made reliability tests and an objective scoring system to proactively measure reliability, teams can systematically baseline and remediate reliability risks organization-wide.
What is a Reliability Management Platform?
A Reliability Management Platform is a solution that lets teams implement Reliability Management across their organization in a guided and automated way. Reliability Management Platforms, such as Gremlin, consistently test and measure reliability risks in the background throughout the software development lifecycle, freeing up valuable resources for higher-value work. It enables teams to run reliability tests on their services, integrate with their observability tools, generate scores, and automate reliability testing. It also provides a central location for storing reliability scores across the organization so teams can easily compare scores, identify unreliable services, and monitor actively running tests.
How does a Reliability Management Platform work?
Reliability Management can be easy to implement, especially when using a solution like Gremlin. Our Reliability Management Platform provides pre-built reliability tests that you can run on any of your services with the click of a button. It only takes three steps:
- Baseline your services' reliability by adding one or more of them to Gremlin, connecting your Golden Signal monitors, and running automated tests against reliability best practices and known risks.
- Remediate reliability issues by identifying and prioritizing high-risk services with automatically-generated reliability scores
- Automate reliability testing on a continuous basis to maintain reliability standards and identify new risks for every service in your organization, without manual effort.
By running our suite of reliability tests on your service, you gain a standardized measurement of that service's reliability. You can easily see which areas of your service are resilient to disruptions and which areas need additional work.
For example, the service in the following screenshot can scale reliably if CPU consumption is too high. However, the Host Redundancy test failed, indicating that the service has no redundancy or replication. Also, the fact that the PostgreSQL Failure Test in the Dependencies section failed means that the service is tightly coupled to its database and has no way of recovering if it goes down.
Systems and services change over time, so like traditional QA tests, reliability tests need to run regularly. Gremlin includes an auto-scheduling service that runs the full suite of tests on a service once a week. You can monitor week-over-week trends to track improvements (or downgrades) to your services and ensure your efforts are paying off. You can also use the Gremlin REST API to integrate reliability scores with external tools, like CI/CD platforms. This lets you create more advanced workflows, such as notifying a service owner or blocking a production deployment if a service's reliability score falls below a certain threshold. You can even link reliability scores to specific builds to trace a reliability score decrease back to its cause.
Who benefits from Reliability Management?
Reliability Management lets everyone in the organization measure reliability proactively, identify areas of risk, and test their services in a standardized way. Ultimately, this results in a stronger reliability posture, faster release cycles, and improved customer experiences.
Within the organization, Reliability Management uniquely benefits a few specific groups:
- IT executives can use scores to gauge the organization's reliability posture at a glance, identify and validate areas of risk, and maintain standards across teams. Scoring can be used to encourage, measure, and recognize the right reliability practices.
- SRE and DevOps teams can use Reliability Management to automate a reliability test suite, measure reliability across the entire organization, and provide tools to application and services owners to proactively improve their reliability. These teams can show progress toward more reliable systems—without waiting for incidents—and free up time for higher-value work.
- Application and service owners can quickly test their services to ensure they meet reliability targets, rather than be surprised in production, and maintain reliability without involving an SRE team.
How to get started with Reliability Management
If you'd like to learn more about the Gremlin Reliability Management Platform, visit gremlin.com/demo to request a free demo.