Incident Post-Mortems: A Step-by-Step Checklist

A post-mortem done well transforms an incident from a painful experience into a learning opportunity. Done poorly, it becomes a blame session that teaches nothing and damages team trust. This checklist covers everything you need to run effective, blameless post-mortems that actually prevent future incidents.

What Makes a Good Post-Mortem?

An effective post-mortem answers three questions: What happened? Why did it happen? And what will we do to prevent it from happening again? But the format matters less than the culture. As Google's SRE Book emphasizes, a post-mortem in a blame-focused organization will always be shallow because people won't share honestly.

The goal isn't to find who was at fault. It's to understand what systemic factors allowed the incident to happen and what changes will make the system more resilient.

Blameless Doesn't Mean Accountable-less

Blameless culture means we don't punish individuals for honest mistakes. It doesn't mean nobody is responsible for fixing things. Every action item should have an owner. Every improvement should have accountability. We just don't blame people for the incidents that reveal the need for those improvements. For a comprehensive framework on building this culture, see PagerDuty's post-mortem guide.

Complete Post-Mortem Checklist

Phase 1: Immediate (Within 24 Hours)

Confirm the incident is fully resolved
Don't start the post-mortem while still firefighting
Assign a post-mortem lead
Someone to gather data, schedule the meeting, and drive the process
Preserve evidence
Save logs, metrics, chat transcripts, and any temporary debugging artifacts before they expire or get cleaned up
Document immediate facts while fresh
Ask responders to write brief notes about what they did and observed
Schedule the post-mortem meeting
Aim for 1-3 days after resolution, while details are fresh but emotions have cooled

Phase 2: Preparation (Before the Meeting)

Build the incident timeline
Gather logs, metrics, and communications into a chronological sequence
Identify key timestamps
When did the incident start? When was it detected? When was it resolved?
Calculate impact metrics
Duration, affected users/requests, error rates, revenue impact if applicable
Identify attendees
Include responders, relevant engineers, and affected stakeholders
Share pre-read materials
Send the timeline and basic facts so people come prepared
Review previous similar incidents
Check if this is a pattern or first occurrence

Timeline Building Tip

Building timelines from raw logs can take hours of tedious work. OutageReview lets you drag and drop events into a visual timeline, reducing hours of manual work to minutes. The key is to verify the extracted timeline and add context that only your team knows, like who made decisions and why.

Phase 3: The Post-Mortem Meeting

Set the tone
Remind everyone this is blameless; we're here to learn, not to punish
Review the timeline together
Walk through what happened, filling in gaps and correcting errors
Conduct root cause analysis
Use 5 Whys, Fishbone, or other RCA techniques to dig into causes
Identify multiple contributing factors
Don't stop at one cause; complex incidents have multiple factors
Discuss what went well
Acknowledge effective response actions; reinforce good practices
Brainstorm preventive actions
What changes would prevent this incident or detect it faster?
Prioritize action items
Not everything can be fixed immediately; focus on highest impact
Assign owners and deadlines
Every action item needs a specific owner and due date

Phase 4: Documentation

Write a clear incident summary
One paragraph that anyone can understand
Document the complete timeline
Include timestamps, events, and decision points
Record root causes
Both immediate and underlying systemic causes
Document impact
Duration, affected services, user impact, business impact
List all action items
With owners, due dates, and priority
Include lessons learned
What should others know about this type of incident?
Link to relevant resources
Dashboards, runbooks, code changes, related incidents

Phase 5: Follow-Through

Share the post-mortem widely
Other teams can learn from your experience
Enter action items in your tracking system
Treat them like any other work item
Schedule follow-up reviews
Check that actions are completed on schedule
Verify action effectiveness
Did the changes actually prevent recurrence?
Update runbooks and documentation
Incorporate lessons learned into operational docs
Add to lessons learned library
Make post-mortems searchable for future reference

Post-Mortem Document Template

Post-Mortem: [Incident Title]

Date

[Date of incident]

Author

[Post-mortem lead]

Status

[Draft / Final / Actions Complete]

Summary

[1-2 paragraph summary: what happened, impact, and key takeaway]

Impact

Duration: [X hours/minutes]
Users affected: [number or percentage]
Services affected: [list]
Business impact: [revenue, SLA, etc.]

Timeline

[Detailed chronological timeline with timestamps]

Root Causes

[List of contributing factors at technical and systemic levels]

What Went Well

[Effective response actions]

What Could Be Improved

[Areas for improvement in detection, response, or prevention]

Action Items

Action	Owner	Due Date	Status
[Action description]	[Name]	[Date]	[Status]

Lessons Learned

[Key insights for the organization]

Common Post-Mortem Anti-Patterns

The Blame Game

If your post-mortem includes phrases like "Bob should have..." or "The team failed to...", you're doing it wrong. Reframe: What systemic factors allowed this to happen? What guardrails were missing?

Example: A team's post-mortem for a database outage focused entirely on the junior engineer who ran a migration without a backup. The real cause—no migration safety checks in the deployment pipeline—went unaddressed. The same type of failure happened again two months later.

The Checkbox Exercise

Post-mortems done because policy requires them, not because the team wants to learn, produce shallow analysis and ignored action items. If you're just going through the motions, you're wasting everyone's time.

Example: An organization requires post-mortems for all P1s. The template gets copy-pasted, the "root cause" field says "configuration error," the actions say "be more careful." Three months of post-mortems, zero meaningful improvements.

The Novel

A 20-page post-mortem that nobody reads helps nobody. Be thorough but concise. Focus on what matters for preventing recurrence.

Example: A 15-page post-mortem with full log dumps and every Slack message from the incident channel. Nobody reads past page two. The critical insight on page 12 is never acted on.

The Action Item Graveyard

Post-mortems that produce action items that never get done are theater. Track actions to completion or don't bother writing them down.

Example: A team's post-mortem produces 8 action items. Six months later, only 1 has been completed. The others were deprioritized in favor of feature work. When the same failure happens again, the team writes the same action items.

When to Do a Post-Mortem (and When to Skip)

Not every incident needs a full post-mortem meeting. Over-post-morteming creates "outage fatigue"—teams start treating every review as a chore, and the quality of analysis drops across the board. The key is matching the depth of your review to the severity and novelty of the incident.

Suggested Severity Thresholds

P1
Service down, customer impact
Full post-mortem with meeting, always. No exceptions. Document the timeline, conduct root cause analysis, assign action items, and schedule follow-up reviews.
P2
Degraded service
Post-mortem document required. Hold a meeting only if this is a novel failure mode the team hasn't seen before. If it's a well-understood type of failure, the written document is sufficient.
P3
Minor issue, no customer impact
A brief write-up is enough. No meeting needed. Capture what happened and any quick fixes, but don't burn an hour of the team's time on a non-event.
Repeat
Recurring incidents regardless of severity
Always post-mortem recurring issues, even if they're P3. Repeated incidents indicate systemic problems that won't go away on their own. If the same class of failure keeps happening, something is broken in the process, architecture, or tooling.

For lower-severity incidents that still warrant a review, consider using a lighter-weight format. Our guide to the 30-minute post-mortem provides a streamlined approach that captures the essential learnings without the overhead of a full post-mortem process.

Measuring Post-Mortem Quality

How do you know if your post-mortems are effective? Consider tracking these incident management metrics:

Action completion rate: What percentage of actions get done?
Time to completion: How long do actions take to implement?
Recurrence rate: Do similar incidents happen again?
Investigation depth: Are you finding systemic causes or just immediate triggers?

Some teams implement quality scoring for post-mortems, measuring completeness of timelines, depth of root cause analysis, specificity of actions, and follow-through on implementation. If your meetings are running too long, our guide to the 30-minute post-mortem shows how to keep reviews focused and productive.

Rigor Scoring: An Objective Quality Measurement

One of the most effective ways to improve post-mortem quality over time is to apply a rigor score—an objective measurement of how thorough and actionable a post-mortem actually is. Rather than relying on subjective assessments like "this post-mortem felt good," rigor scoring evaluates specific dimensions:

Timeline completeness: Are there gaps in the timeline? A good post-mortem accounts for the full incident lifecycle from first signal to resolution, with no unexplained jumps.
Root cause depth: Did the analysis get past the immediate trigger to the underlying systemic causes? "A bad config was deployed" is surface-level. "Our config change process lacks validation and rollback capabilities" is systemic.
Action specificity: Are the action items measurable and verifiable? "Improve monitoring" is vague. "Add latency alerting threshold at p99 > 500ms on the payments service by March 15" is specific.
Follow-through rate: Did the team actually complete the actions they committed to? A post-mortem with perfect analysis but 0% follow-through produced no value.

Tracking rigor scores across your post-mortems reveals trends: Are your reviews getting more thorough over time? Are certain teams consistently producing shallow analysis? Are action items actually getting done? OutageReview provides automated rigor scoring that evaluates each post-mortem against these dimensions, giving teams an objective baseline and a clear path to improvement.

Frequently Asked Questions

What is a blameless post-mortem?

A blameless post-mortem is an incident review conducted under the principle that individuals are not punished for honest mistakes. Instead of asking "who caused this?", the investigation focuses on "what systemic factors allowed this to happen?" This approach, popularized by Google's SRE practices, encourages honest information sharing and leads to more effective preventive actions because people aren't hiding information to protect themselves.

How soon after an incident should you do a post-mortem?

Ideally within 1-3 business days after the incident is resolved. This window balances two competing needs: memories are still fresh (details fade quickly after a week), but emotions have cooled enough for productive discussion. Start evidence preservation immediately—save logs, chat transcripts, and metrics before they expire—but schedule the actual meeting for 24-72 hours after resolution.

Who should attend a post-mortem meeting?

Include the incident responders (those who actively worked to resolve it), engineers who own the affected services, the on-call engineer, and a facilitator. Optionally invite affected stakeholders (product managers, customer support leads) for impact context. Keep the group under 8-10 people—larger groups make discussion difficult. People who weren't involved but want to learn can read the written post-mortem document afterward.

How long should a post-mortem meeting take?

30-60 minutes for a well-prepared post-mortem. If your meetings routinely exceed an hour, the preparation is likely insufficient—timeline reconstruction and evidence gathering should happen asynchronously before the meeting. The meeting itself should focus on analysis (why it happened) and action planning (what to do about it), not on reconstructing what happened. See our guide to the 30-minute post-mortem for tips on keeping meetings focused.

Key Takeaways

Start within 24 hours while details are fresh
Build a detailed timeline before analyzing causes
Focus on systems and processes, not individual blame
Track action items to completion, then verify they worked