You can't improve what you don't measure. But measuring the wrong things, or measuring the right things badly, can be worse than not measuring at all. This guide covers the incident management metrics that actually matter, how to measure them correctly, and how to use them to drive real improvement.
The Core Metrics
MTTR: Mean Time to Recovery
MTTR is the average time from when an incident is detected to when service is restored. It's the most commonly tracked incident metric, and for good reason: it directly measures how long your users suffer.
MTTR Calculation
MTTR = Total Recovery Time / Number of Incidents
Example: If you had 10 incidents last month with a combined recovery time of 500 minutes, your MTTR is 50 minutes.
A common mistake is conflating MTTR with resolution time. Recovery means the service is working again. Resolution means the underlying issue is fixed. You might recover in 5 minutes by restarting a server, but not resolve the root cause for days.
Breaking Down MTTR
MTTR can be decomposed into sub-metrics that reveal where your response process needs work:
- MTTD (Mean Time to Detect): How long before you know there's a problem?
- MTTA (Mean Time to Acknowledge): How long before someone starts working on it?
- MTTM (Mean Time to Mitigate): How long to stop the bleeding?
If your MTTD is high, invest in better monitoring and alerting. If MTTA is high, look at your on-call process. If the gap between MTTA and recovery is high, focus on runbooks and automation.
MTBF: Mean Time Between Failures
MTBF measures reliability by tracking how long your systems run without incidents. Higher is better. Unlike MTTR which measures response capability, MTBF measures prevention capability.
MTBF Calculation
MTBF = Total Uptime / Number of Failures
Example: If your service ran for 720 hours (30 days) and had 3 failures, your MTBF is 240 hours.
MTBF is most useful when tracked by service or component. An overall MTBF hides problem areas. You want to know which services are fragile so you can prioritize reliability investment.
Incident Frequency
Simply counting incidents over time. Track this by severity, service, team, and root cause category. Trends matter more than absolute numbers. Is incident frequency increasing or decreasing? Are certain services or teams seeing more incidents than others?
Change Failure Rate
What percentage of deployments cause incidents? This is a key DevOps metric from the DORA (DevOps Research and Assessment) framework. High-performing teams have change failure rates below 15%.
Change Failure Rate Calculation
CFR = (Incidents Caused by Changes / Total Changes) x 100
Example: If you deployed 100 times and 8 deployments caused incidents, your CFR is 8%.
Advanced Metrics
Incident Recurrence Rate
What percentage of incidents are repeats of previous incidents? This directly measures the effectiveness of your root cause analysis and action follow-through. If you're seeing the same problems repeatedly, your RCA process isn't working.
Industry Benchmark
Research suggests that 84% of IT system failures are repeat incidents. If you can reduce your recurrence rate below industry average, you're saving significant engineering time and customer impact.
Customer Impact Metrics
Technical metrics don't tell the whole story. Consider tracking:
- User-minutes of impact: Duration x affected users
- Error budget consumption: How much of your SLO did this incident consume?
- Customer-reported incidents: How many incidents do customers find before you do?
- Revenue impact: If measurable, tie incidents to business outcomes
RCA Quality Metrics
Are your investigations thorough? Consider tracking:
- Action completion rate: What percentage of post-mortem actions get done?
- Time to action completion: How long do actions sit in the backlog?
- Root cause depth: Are you finding systemic causes or stopping at symptoms?
- Investigation rigor score: Objective quality measurement of the analysis
Measuring Investigation Quality
Some teams use rigor scoring to objectively measure investigation quality. A score based on timeline completeness, root cause depth, action specificity, and verification can highlight when investigations are being rushed or when process improvements are needed.
Setting Up Metric Tracking
Data Requirements
To calculate these metrics, you need to track:
- Incident start time: When did the problem begin (not when it was detected)?
- Detection time: When was the incident first noticed?
- Response start time: When did someone start working on it?
- Recovery time: When was service restored?
- Severity: How bad was the impact?
- Affected services: What was impacted?
- Root cause category: What type of failure was this?
- Related changes: Was this caused by a deployment?
Most teams struggle here because this data lives in multiple systems: monitoring tools, incident management platforms, deployment pipelines, chat transcripts. Purpose-built RCA tools that consolidate this data make metric tracking much easier.
Avoiding Metric Gaming
Metrics can be gamed. If you reward low MTTR, people will close incidents prematurely. If you penalize high incident counts, people will under-report. Consider these safeguards:
- Track multiple metrics together, not just one
- Use metrics for learning, not punishment
- Combine lagging indicators (MTTR) with leading indicators (action completion)
- Audit metric accuracy regularly
Using Metrics Effectively
Trending Over Absolute Values
An MTTR of 47 minutes means little in isolation. Is that good? Bad? It depends on your context. What matters is the trend. Is your MTTR improving? Is your incident frequency decreasing? Trends reveal whether your investments in reliability are paying off.
Segmentation
Aggregate metrics hide problems. Segment by:
- Service: Which services are least reliable?
- Severity: Are you getting better at critical incidents?
- Team: Which teams need support?
- Root cause category: Are configuration errors your biggest problem?
- Time: Do incidents spike during certain periods?
Connecting Metrics to Actions
Metrics are only useful if they drive action. For each metric, know:
- What does a good value look like?
- What does a concerning trend look like?
- What actions would you take if the metric is bad?
If you don't know what you'd do with a metric, you probably don't need to track it.
Benchmark Reference
Industry benchmarks vary widely by sector and scale, but here are general reference points from DORA and other research:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| MTTR | < 1 hour | < 1 day | < 1 week | > 1 week |
| Change Failure Rate | < 5% | < 10% | < 15% | > 15% |
| Deployment Frequency | Multiple/day | Daily-Weekly | Weekly-Monthly | > Monthly |
Source: DORA State of DevOps reports. Your targets should reflect your specific context and risk tolerance.
Getting Started
If you're not currently tracking incident metrics, start simple:
- Track incident count and MTTR. These two metrics alone reveal a lot about your reliability posture.
- Segment by severity. Not all incidents are equal. Track critical incidents separately from minor ones.
- Review monthly. Look at trends, not just numbers. Are things getting better or worse?
- Add metrics as needed. When you have questions your current metrics can't answer, add new ones.
Don't try to track everything at once. A few well-measured metrics are more valuable than many poorly-measured ones.
Key Takeaways
- MTTR measures response capability; MTBF measures prevention capability
- Segment metrics by service, severity, and root cause to reveal problems
- Track trends over time, not just absolute values
- Only track metrics you'll act on; few well-measured beats many poorly-measured