Top Incident Management Metrics You Should Track (MTTR/MTBF)

You can't improve what you don't measure. But measuring the wrong things, or measuring the right things badly, can be worse than not measuring at all. This guide covers the incident management metrics that actually matter, how to measure them correctly, and how to use them to drive real improvement.

The Core Metrics

MTTR: Mean Time to Recovery

MTTR is the average time from when an incident is detected to when service is restored. It's the most commonly tracked incident metric, and for good reason: it directly measures how long your users suffer.

MTTR Calculation

MTTR = Total Recovery Time / Number of Incidents

Example: If you had 10 incidents last month with a combined recovery time of 500 minutes, your MTTR is 50 minutes.

A common mistake is conflating MTTR with resolution time. Recovery means the service is working again. Resolution means the underlying issue is fixed. You might recover in 5 minutes by restarting a server, but not resolve the root cause for days.

Breaking Down MTTR

MTTR can be decomposed into sub-metrics that reveal where your response process needs work:

MTTD (Mean Time to Detect): How long before you know there's a problem?
MTTA (Mean Time to Acknowledge): How long before someone starts working on it?
MTTM (Mean Time to Mitigate): How long to stop the bleeding?

If your MTTD is high, invest in better monitoring and alerting. If MTTA is high, look at your on-call process. If the gap between MTTA and recovery is high, focus on runbooks and automation.

MTBF: Mean Time Between Failures

MTBF measures reliability by tracking how long your systems run without incidents. Higher is better. Unlike MTTR which measures response capability, MTBF measures prevention capability.

MTBF Calculation

MTBF = Total Uptime / Number of Failures

Example: If your service ran for 720 hours (30 days) and had 3 failures, your MTBF is 240 hours.

MTBF is most useful when tracked by service or component. An overall MTBF hides problem areas. You want to know which services are fragile so you can prioritize reliability investment.

Incident Frequency

Simply counting incidents over time. Track this by severity, service, team, and root cause category. Trends matter more than absolute numbers. Is incident frequency increasing or decreasing? Are certain services or teams seeing more incidents than others?

Change Failure Rate

What percentage of deployments cause incidents? This is a key DevOps metric from the DORA (DevOps Research and Assessment) framework. High-performing teams have change failure rates below 15%.

Change Failure Rate Calculation

CFR = (Incidents Caused by Changes / Total Changes) x 100

Example: If you deployed 100 times and 8 deployments caused incidents, your CFR is 8%.

Lead Time for Changes

Lead Time for Changes is the fourth key DORA metric, alongside deployment frequency, change failure rate, and MTTR. It measures the elapsed time from when a code commit is made to when that code is successfully running in production. This encompasses code review, CI/CD pipeline execution, testing, staging, and the actual deployment process.

The performance benchmarks for lead time are dramatic. Elite-performing teams achieve a lead time of less than one hour from commit to production. High performers measure their lead time in days, medium performers in weeks, and low performers can take more than six months to get a change into production. The gap between elite and low performers is enormous, and it has direct consequences for incident management.

Why does lead time matter for incident response? When a production incident occurs and your team identifies the fix, lead time determines how quickly that fix reaches your users. A team with a one-hour lead time can commit a fix and have it live within the hour. A team with a one-month lead time faces the prospect of customers suffering for weeks while the fix works its way through the pipeline. Shorter lead times also encourage smaller, more incremental changes, which are inherently less risky and easier to roll back when they cause problems.

The Four DORA Metrics Together

The real power of the DORA framework emerges when you look at all four metrics together rather than optimizing any single metric in isolation. Research from the annual State of DevOps Report from DORA provides benchmarks that consistently show these four metrics are correlated: teams that excel at one tend to excel at all four, and teams that struggle with one typically struggle across the board.

The four metrics divide neatly into two categories. Deployment frequency and lead time for changes measure speed: how quickly can your team deliver value to users? Change failure rate and MTTR measure stability: how reliable is your delivery process, and how quickly do you recover from problems? The critical insight from DORA research is that speed and stability are not tradeoffs. The highest-performing teams are both faster and more stable than their peers. They deploy more frequently, with shorter lead times, while simultaneously having lower failure rates and faster recovery times.

This means that if your MTTR is high, you shouldn't look at it in isolation. Investigate your deployment frequency and lead time as well. Teams that deploy infrequently tend to deploy large, risky batches of changes, which drives up both change failure rate and recovery time. Improving your deployment pipeline to enable smaller, more frequent deployments often improves all four metrics simultaneously.

Advanced Metrics

Incident Recurrence Rate

What percentage of incidents are repeats of previous incidents? This directly measures the effectiveness of your root cause analysis and action follow-through. If you're seeing the same problems repeatedly, your RCA process isn't working.

Industry Benchmark

Research suggests that 84% of IT system failures are repeat incidents. If you can reduce your recurrence rate below industry average, you're saving significant engineering time and customer impact.

Customer Impact Metrics

Technical metrics don't tell the whole story. Consider tracking:

User-minutes of impact: Duration x affected users
Error budget consumption: How much of your SLO did this incident consume?
Customer-reported incidents: How many incidents do customers find before you do?
Revenue impact: If measurable, tie incidents to business outcomes

RCA Quality Metrics

Are your investigations thorough? Consider tracking:

Action completion rate: What percentage of post-mortem actions get done?
Time to action completion: How long do actions sit in the backlog?
Root cause depth: Are you finding systemic causes or stopping at symptoms?
Investigation rigor score: Objective quality measurement of the analysis

Measuring Investigation Quality

Some teams use rigor scoring to objectively measure investigation quality. A score based on timeline completeness, root cause depth, action specificity, and verification can highlight when investigations are being rushed or when process improvements are needed.

Setting Up Metric Tracking

Data Requirements

To calculate these metrics, you need to track:

Incident start time: When did the problem begin (not when it was detected)?
Detection time: When was the incident first noticed?
Response start time: When did someone start working on it?
Recovery time: When was service restored?
Severity: How bad was the impact?
Affected services: What was impacted?
Root cause category: What type of failure was this?
Related changes: Was this caused by a deployment?

Most teams struggle here because this data lives in multiple systems: monitoring tools, incident management platforms, deployment pipelines, chat transcripts. OutageReview consolidates this data into structured investigations with built-in MTTR tracking, rigor scoring, and action item monitoring, so you don't have to build metric tracking from scratch. Generic ticketing tools like Jira make this particularly hard because they lack the structured data model RCA requires.

Building Your Incident Dashboard

Once you're collecting incident data, the next step is making it visible and actionable through a well-designed dashboard. The goal isn't to create the most comprehensive dashboard possible. It's to create one that drives the right conversations and decisions.

Start with three to four key metrics visible to the entire team: total incident count over time, current MTTR trend, open action items from past investigations, and recurrence rate. These four metrics together give you a snapshot of both your current reliability posture and whether your improvement efforts are working. Resist the temptation to add more metrics until your team has internalized these core ones.

Review Cadence

Establish a regular review cadence that matches your audience. Operational teams should review the dashboard weekly, focusing on recent incidents, MTTR trends, and overdue action items. Leadership should review monthly, focusing on longer-term trends, cross-team patterns, and resource allocation decisions. Mismatching the cadence to the audience leads to either information overload or stale data driving decisions.

Avoiding Vanity Metrics

Be ruthless about removing vanity metrics from your dashboard. Vanity metrics are numbers that look impressive in presentations but don't drive action. A dashboard showing "99.97% uptime" in a large green box might look reassuring, but if nobody changes their behavior based on that number, it's taking up space that could be used for something more useful.

The key principle for every metric on your dashboard: there should be a defined response when it trends in the wrong direction. If MTTR increases for two consecutive weeks, who investigates? If the recurrence rate spikes, what's the action? If open action items exceed a threshold, who escalates? If you can't answer these questions for a metric, it doesn't belong on the dashboard.

Built-in Dashboards

OutageReview provides built-in dashboards with these metrics out of the box, including MTTR trending, action item tracking, recurrence detection, and investigation rigor scoring. Rather than building dashboard infrastructure from scratch, teams can focus on the analysis and improvement work that actually reduces incidents.

Avoiding Metric Gaming

Metrics can be gamed. If you reward low MTTR, people will close incidents prematurely. If you penalize high incident counts, people will under-report. Consider these safeguards:

Track multiple metrics together, not just one
Use metrics for learning, not punishment
Combine lagging indicators (MTTR) with leading indicators (action completion)
Audit metric accuracy regularly

To make this concrete, here are three common gaming patterns and how to counteract them:

Gaming Pattern 1: Premature Incident Closure

If you track MTTR as a team KPI, engineers may close incidents before fully verifying the fix, only to have the issue recur hours later. Track re-open rate alongside MTTR to catch this.

Gaming Pattern 2: Severity Inflation

If you penalize teams for high incident counts, they'll raise severity thresholds so fewer incidents get formally logged. The problems don't disappear—they just become invisible.

Gaming Pattern 3: Shallow Investigations

If you reward fast post-mortem completion, you'll get shallow investigations submitted quickly. Track investigation rigor score alongside completion time.

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." This principle, known as Goodhart's Law, is the fundamental reason why metric gaming occurs. The moment you attach incentives or consequences to a specific number, people optimize for the number rather than the outcome the number was meant to represent. The antidote is to always pair quantitative metrics with qualitative review and to use metrics as conversation starters, not scorecards.

Using Metrics Effectively

Trending Over Absolute Values

An MTTR of 47 minutes means little in isolation. Is that good? Bad? It depends on your context. What matters is the trend. Is your MTTR improving? Is your incident frequency decreasing? Trends reveal whether your investments in reliability are paying off.

Segmentation

Aggregate metrics hide problems. Segment by:

Service: Which services are least reliable?
Severity: Are you getting better at critical incidents?
Team: Which teams need support?
Root cause category: Are configuration errors your biggest problem?
Time: Do incidents spike during certain periods?

Connecting Metrics to Actions

Metrics are only useful if they drive action. For each metric, know:

What does a good value look like?
What does a concerning trend look like?
What actions would you take if the metric is bad?

If you don't know what you'd do with a metric, you probably don't need to track it.

Benchmark Reference

Industry benchmarks vary widely by sector and scale, but here are general reference points from DORA and other research:

Metric	Elite	High	Medium	Low
MTTR	< 1 hour	< 1 day	< 1 week	> 1 week
Change Failure Rate	< 5%	< 10%	< 15%	> 15%
Deployment Frequency	Multiple/day	Daily-Weekly	Weekly-Monthly	> Monthly
Lead Time for Changes	< 1 hour	1 day-1 week	1 week-1 month	> 6 months

Source: DORA State of DevOps reports. Your targets should reflect your specific context and risk tolerance.

Getting Started

If you're not currently tracking incident metrics, start simple:

Track incident count and MTTR. These two metrics alone reveal a lot about your reliability posture.
Segment by severity. Not all incidents are equal. Track critical incidents separately from minor ones.
Review monthly. Look at trends, not just numbers. Are things getting better or worse?
Add metrics as needed. When you have questions your current metrics can't answer, add new ones.

Don't try to track everything at once. A few well-measured metrics are more valuable than many poorly-measured ones.

Frequently Asked Questions

What is MTTR?

MTTR stands for Mean Time to Recovery (or Mean Time to Restore). It measures the average time from when an incident is detected to when normal service is restored. MTTR is one of the four key DORA metrics used to assess software delivery performance. A lower MTTR indicates a team's ability to quickly respond to and resolve incidents, reducing customer impact.

What is a good MTTR?

According to DORA research, elite-performing teams achieve an MTTR of less than 1 hour. High performers restore service within a day, while medium performers take less than a week. However, "good" depends on your context—a 30-minute MTTR for a P1 e-commerce outage is excellent, while the same MTTR for a non-critical internal tool may indicate over-investment. Focus on trending MTTR downward over time rather than hitting a specific benchmark.

What is the difference between MTTR and MTBF?

MTTR (Mean Time to Recovery) measures how quickly you respond to and fix failures—it's a measure of your response capability. MTBF (Mean Time Between Failures) measures how long your systems run without incident—it's a measure of your prevention capability. Both matter: MTTR tells you how good you are at fighting fires, while MTBF tells you how good you are at preventing them. Improving MTBF (through better RCA, testing, and architecture) is generally more valuable than improving MTTR alone.

What are DORA metrics?

DORA (DevOps Research and Assessment) metrics are four key measures of software delivery performance identified by the DORA research program: Deployment Frequency (how often you deploy), Lead Time for Changes (time from commit to production), Change Failure Rate (percentage of deployments causing incidents), and Mean Time to Recovery (how quickly you restore service). Together, these metrics measure both the speed and stability of your software delivery process.

Key Takeaways

MTTR measures response capability; MTBF measures prevention capability
Segment metrics by service, severity, and root cause to reveal problems
Track trends over time, not just absolute values
Only track metrics you'll act on; few well-measured beats many poorly-measured