Incident reviews are supposed to prevent the next outage. In practice, they often become the meeting everyone dreads, and eventually stops attending.
You've seen it happen. The first few post-mortems after a major incident have full attendance. But by the third or fourth review of the quarter, the invite list starts to thin. Engineers mark "tentative" and then don't show. Managers send delegates. The on-call engineer sits alone on a call with two other people, none of whom were actually involved in the incident.
Outage fatigue is real. And it's a symptom of a broken process.
Why Engineers Stop Showing Up
The problem isn't that engineers don't care about reliability. It's that most post-mortem meetings are a terrible use of their time.
The meetings take forever. What's scheduled for an hour routinely bleeds into ninety minutes, sometimes two hours. You've got twelve people on the call, but only three of them are actively contributing at any moment. Everyone else is waiting, checking Slack, half-listening, wondering why they're here.
The first hour is clerical work. Someone asks when the database CPU actually spiked. Another person digs through Slack to find who deployed the hotfix. The moderator spends twenty minutes reconciling three different versions of the timeline. High-leverage engineers sit idle, watching one person type.
Then the blame game starts.
I once sat in a post-mortem where two Directors spent 45 minutes arguing over which team "owned" the root cause. It wasn't a technical debate. It was a political one. Their yearly bonuses were tied to outage ownership metrics. Too many incidents attributed to your team, and your compensation took a hit.
So instead of discussing how to prevent recurrence, the meeting became a negotiation. "That wasn't really our service's fault. The upstream team sent malformed data." "But your service should have validated the input." Back and forth, while everyone else sat in uncomfortable silence.
The incident never got a real root cause. It got a compromise. And six months later, the same failure mode took down production again.
What Clouds the Discussion
Blame dynamics are just one of the ways post-mortems go sideways. The other common failure modes:
Timeline reconstruction as a group activity. Twelve people watching one person copy-paste timestamps from CloudWatch into a Google Doc. This should never happen in a synchronous meeting.
The "what" drowns out the "why." Teams spend so long agreeing on the sequence of events that they run out of time for actual analysis. The 5 Whys becomes the 2 Whys, followed by "we'll take it offline."
Hero worship. The meeting becomes a retelling of how the on-call engineer heroically saved the day, rather than an examination of why heroics were required in the first place.
Scope creep. Every tangentially related issue gets surfaced. "While we're here, should we also talk about that other thing that happened last week?" The meeting loses focus and nothing gets resolved.
Decision paralysis on action items. Twenty minutes spent debating whether to "add monitoring" versus "improve monitoring" versus "evaluate monitoring options." The action item that gets written down is so vague it never gets done.
By the end, engineers leave feeling like they wasted two hours. The next time a post-mortem invite lands in their calendar, they remember that feeling.
The Fix: Make Meetings Worth Attending
The goal isn't to eliminate post-mortems. It's to make them valuable enough that people want to attend. As the Learning from Incidents community emphasizes, the purpose of incident analysis is organizational learning—and that means respecting people's time and focusing on what humans do best: analysis, not administration. (If you need a full process from start to finish, see our step-by-step post-mortem checklist.)
Before the Meeting: The 15-Minute Prep
The 30-minute meeting only works if the facilitator does their homework. This prep should take no more than 15 minutes—if it takes longer, the tooling is the bottleneck, not the process.
The facilitator should complete the following before the meeting starts:
- Timeline built from logs and alerts (5 minutes with proper tooling). Pull timestamps from monitoring, deployment logs, and chat history into a coherent sequence of events.
- Impact summary documented: duration of the incident, number of affected users, severity classification. This grounds the discussion in facts.
- Key participants confirmed: who was on-call, who deployed the change, who first detected the issue. These are the people who need to be in the room.
- Pre-read shared at least 2 hours before the meeting. Attendees should arrive having already reviewed the timeline and impact summary. No cold starts.
If prep consistently takes more than 15 minutes, invest in better tooling. The meeting's quality is directly proportional to the quality of preparation.
1. Move Timeline Construction to Async
The timeline should be complete before anyone joins the call.
This requires tooling that makes async prep easy. If building a timeline means manually formatting timestamps in a wiki page, it won't happen. Engineers will procrastinate until the meeting starts, and you're back to group clerical work.
OutageReview is designed for this. Drag and drop alerts and chat messages into a visual timeline. When the paperwork takes five minutes instead of an hour, it actually gets done ahead of time.
2. Timebox Ruthlessly
A focused post-mortem should take thirty minutes:
- 5Review the pre-built timeline
Quick walkthrough, not reconstruction
- 155 Whys analysis
Structured root cause discussion
- 10Action items
Specific, assigned, with due dates
If you can't finish in thirty minutes, you either weren't prepared or the incident is complex enough to warrant a dedicated deep-dive session. Schedule that separately, with the right people.
A 30-Minute Review in Practice
Theory is one thing. Here's what a 30-minute post-mortem actually looks like, using a real-world scenario: a payment API latency spike that caused checkout failures for 17 minutes.
Minutes 0–5: Timeline Review
The facilitator shares the pre-built timeline on screen. The key events are already documented:
- 2:14 PM — Deploy of payment-service v2.31 goes out
- 2:18 PM — Latency alerts fire on the payment API (p99 > 5s)
- 2:22 PM — First customer reports arrive via support channel
- 2:31 PM — On-call engineer rolls back to v2.30
- 2:35 PM — Latency returns to normal, recovery confirmed
The facilitator asks: "Anything missing?" One engineer adds that they noticed elevated error rates in the downstream notification service too—a contributing detail that wasn't in the automated alerts. The timeline is accepted in under five minutes. Move on.
Minutes 5–20: 5 Whys Analysis
The facilitator walks the group through the causal chain:
- Why did latency spike? The new endpoint was making synchronous calls to a slow external API.
- Why wasn't this caught in staging? Staging uses a mock for the external API.
- Why do we mock the external API? It has rate limits that staging would hit.
- Why wasn't there a timeout configured? There's no standard for external API timeouts in the codebase.
Root cause: No timeout standards for external dependencies. Contributing factor: The staging environment doesn't surface integration performance issues because external services are mocked without latency simulation.
Minutes 20–30: Action Items
The team defines three specific, assigned actions with deadlines:
- Action 1: Add 3-second timeout standard for all external API calls. Owner: Sarah. Due: Feb 28.
- Action 2: Add latency circuit breaker to payment service. Owner: James. Due: Mar 7.
- Action 3: Evaluate staging integration testing options (latency simulation for mocked services). Owner: Sarah. Due: Mar 14.
The meeting ends on time. The investigation doc is already 80% complete because the timeline and impact summary were prepared in advance. The facilitator sends the final write-up within the hour.
3. Remove the Blame Incentive
If bonuses are tied to outage ownership, your post-mortems will be political theater. Full stop.
Blameless culture isn't just a nice-to-have. It's a prerequisite for honest analysis. John Allspaw and Etsy's engineering team pioneered this approach, demonstrating that psychological safety produces better investigations. If people are worried about attribution, they'll spend their energy on defense instead of prevention.
This is an organizational problem, not a tooling problem. But structured investigation helps by focusing the conversation on systems and processes rather than individuals. When the analysis is "the deployment pipeline lacked canary testing" rather than "John pushed a bad config," the discussion stays productive.
4. Score Investigation Quality, Not Incident Ownership
Instead of tracking "how many incidents did Team X cause," track "how thorough was the investigation" and "how many action items were completed." (See our guide to incident management metrics for what to measure and why.)
OutageReview's Rigor Score measures analysis depth objectively. Did the team build a complete timeline? Did they go deep on root cause? Are the action items specific and verified?
This shifts the incentive from "avoid blame" to "do good work." Teams that consistently produce high-rigor investigations get recognized for their engineering discipline, not penalized for the incidents they happened to be involved in.
Make It Worth Their Time
The best post-mortems are the ones where engineers leave feeling like they learned something. Where the discussion surfaces an insight nobody had considered. Where the action items are concrete enough that people actually want to implement them.
That only happens when you respect the meeting's constraints:
- Don't make people sit through administrative work they could have reviewed async
- Don't let political dynamics hijack the technical discussion
- Don't let scope creep turn a focused review into an open forum
Thirty minutes, well-prepared, tightly facilitated. That's a meeting people will show up to.
Facilitator Tips
The facilitator makes or breaks the meeting. Here are practical techniques that keep 30-minute reviews on track:
Keep a parking lot. When someone raises a tangential issue, acknowledge it and add it to a "parking lot" list. Address it after the meeting or in a separate session. This prevents scope creep without dismissing concerns.
Use silence as a tool. After asking "why?", wait at least 5 seconds before prompting. People need time to think. Filling silence with your own theories biases the discussion.
Redirect blame language. When someone says "Team X should have..." rephrase to "What would have helped Team X catch this?" This keeps the conversation systemic without calling people out.
End with a round-robin. Ask each participant: "Is there anything we missed?" This surfaces insights from quieter team members who may not speak up unprompted.
Send the summary within 1 hour. If the write-up arrives days later, people have already moved on. Fast turnaround reinforces that post-mortems matter.
Frequently Asked Questions
How long should a post-mortem take?
A well-prepared post-mortem meeting should take 30 minutes. This requires that timeline reconstruction and evidence gathering happen asynchronously before the meeting. The meeting itself should focus exclusively on root cause analysis and action planning. If your post-mortems regularly exceed an hour, the problem is usually insufficient preparation, not insufficient meeting time.
What is outage fatigue?
Outage fatigue (also called post-mortem fatigue or incident fatigue) occurs when engineers become disengaged from the incident review process because meetings feel like a waste of time. Common causes include lengthy meetings dominated by timeline reconstruction, blame dynamics, scope creep, and action items that never get completed. The result is declining attendance, shallow investigations, and ultimately, more repeat incidents. The solution is making meetings shorter, better-prepared, and genuinely valuable.
Should every incident have a post-mortem?
No. Over-post-morteming causes the same fatigue that poor-quality post-mortems do. Reserve full post-mortem meetings for P1 incidents (service down, customer impact) and repeat incidents of any severity. P2 incidents should get a written analysis but may not need a meeting. P3 incidents need only a brief write-up. The key exception: any incident with a novel failure mode should get a thorough review regardless of severity, because these represent new risks.
What is async incident review?
Async incident review is the practice of performing parts of the post-mortem process outside of synchronous meetings. This typically includes timeline construction, evidence gathering, and initial root cause hypotheses—all done in a shared document or tool before the meeting starts. The Learning from Incidents community advocates for async-first approaches where the meeting is the shortest possible step, focused only on discussion that requires real-time collaboration.
Key Takeaways
- Outage fatigue happens when meetings waste time on clerical work
- Move timeline construction to async so meetings focus on analysis
- Blame incentives (like bonus structures tied to outages) poison the discussion
- A well-prepared 30-minute meeting beats a rambling 2-hour session