Change failure insights

Learn how to setup and measure the quality metrics in DORA: Change Failure Rate and Mean Time To Recovery.

Teams measure Change Failure Rate to maintain engineering quality while accelerating their engineering performance. Change Failure Rate communicates the prevalence of issues or defects when deploying a change to production. It's defined as the ratio of failures to the total number of deployments.

In reality, the definitions have more nuance: there are different types of failures (eg. downtime, broken features, service degradation, etc.), which each come with their own impact.

Mean Time To Recovery (MTTR) is the average time it takes to address change failures. It helps teams understand how quickly they're able to resolve issues.

help-center-failures-fixed

Definitions

Change Failure Rate

Swarmia uses deployments as the basis for change failures. Deploys that fix other deploys (eg. a patch, hotfix, rollback, forward fix) mark the original deploy as failure.

We look at the number of such failed deployments and calculate the Change Failure Rate by comparing this to the number of total deployments.

Mean Time To Recovery

Time To Recovery (TTR) can be determined for each failure as the time between the original deploy, and the fix for the problem. TTR can be used to understand the impact of each change failure (how long did the problem last, and what was its impact to the customer).

Sending deployment data

Swarmia ingests deployment data via the Deployments API.

Automating fix deployments

You can use the Deployments API to mark a deployment as a fix for an earlier deployment. By integrating into eg. the hotfix or rollback process, you can maximize the quality of change failure data and get reliable insights on the quality of your engineering process.

It's also possible to measure failures in non-production environments, to measure the quality of your engineering process (i.e. a proxy for extra engineering time spent on getting something to work). We recommend getting started by at least measuring the production failures.

Manually marking deployments as fixes

Alternatively, you can use the deployment table in Change Failure Insights to mark a deployment as fix by hand. We fetch previous deploys for quick access, or you can search with the deploy version.

help-center-mark-as-fix

Analyzing change failures

Once you're sending deployment data to Swarmia, you can go to Change Failure Insights to analyze the deployment quality of different apps or environments.

help-center-failures-content-fixed