Post-Mortems9 min read

Incident Post-Mortems: A Step-by-Step Checklist

A complete checklist for running effective blameless post-mortems. From preparation to follow-up actions, ensure your team learns from every incident.

By OutageReview Team|December 10, 2025

A post-mortem done well transforms an incident from a painful experience into a learning opportunity. Done poorly, it becomes a blame session that teaches nothing and damages team trust. This checklist covers everything you need to run effective, blameless post-mortems that actually prevent future incidents.

What Makes a Good Post-Mortem?

An effective post-mortem answers three questions: What happened? Why did it happen? And what will we do to prevent it from happening again? But the format matters less than the culture. As Google's SRE Book emphasizes, a post-mortem in a blame-focused organization will always be shallow because people won't share honestly.

The goal isn't to find who was at fault. It's to understand what systemic factors allowed the incident to happen and what changes will make the system more resilient.

Blameless Doesn't Mean Accountable-less

Blameless culture means we don't punish individuals for honest mistakes. It doesn't mean nobody is responsible for fixing things. Every action item should have an owner. Every improvement should have accountability. We just don't blame people for the incidents that reveal the need for those improvements. For a comprehensive framework on building this culture, see PagerDuty's post-mortem guide.

Complete Post-Mortem Checklist

Phase 1: Immediate (Within 24 Hours)

  • Confirm the incident is fully resolved

    Don't start the post-mortem while still firefighting

  • Assign a post-mortem lead

    Someone to gather data, schedule the meeting, and drive the process

  • Preserve evidence

    Save logs, metrics, chat transcripts, and any temporary debugging artifacts before they expire or get cleaned up

  • Document immediate facts while fresh

    Ask responders to write brief notes about what they did and observed

  • Schedule the post-mortem meeting

    Aim for 1-3 days after resolution, while details are fresh but emotions have cooled

Phase 2: Preparation (Before the Meeting)

  • Build the incident timeline

    Gather logs, metrics, and communications into a chronological sequence

  • Identify key timestamps

    When did the incident start? When was it detected? When was it resolved?

  • Calculate impact metrics

    Duration, affected users/requests, error rates, revenue impact if applicable

  • Identify attendees

    Include responders, relevant engineers, and affected stakeholders

  • Share pre-read materials

    Send the timeline and basic facts so people come prepared

  • Review previous similar incidents

    Check if this is a pattern or first occurrence

Timeline Building Tip

Building timelines from raw logs can take hours of tedious work. OutageReview lets you drag and drop events into a visual timeline, reducing hours of manual work to minutes. The key is to verify the extracted timeline and add context that only your team knows, like who made decisions and why.

Phase 3: The Post-Mortem Meeting

  • Set the tone

    Remind everyone this is blameless; we're here to learn, not to punish

  • Review the timeline together

    Walk through what happened, filling in gaps and correcting errors

  • Conduct root cause analysis

    Use 5 Whys, Fishbone, or other RCA techniques to dig into causes

  • Identify multiple contributing factors

    Don't stop at one cause; complex incidents have multiple factors

  • Discuss what went well

    Acknowledge effective response actions; reinforce good practices

  • Brainstorm preventive actions

    What changes would prevent this incident or detect it faster?

  • Prioritize action items

    Not everything can be fixed immediately; focus on highest impact

  • Assign owners and deadlines

    Every action item needs a specific owner and due date

Phase 4: Documentation

  • Write a clear incident summary

    One paragraph that anyone can understand

  • Document the complete timeline

    Include timestamps, events, and decision points

  • Record root causes

    Both immediate and underlying systemic causes

  • Document impact

    Duration, affected services, user impact, business impact

  • List all action items

    With owners, due dates, and priority

  • Include lessons learned

    What should others know about this type of incident?

  • Link to relevant resources

    Dashboards, runbooks, code changes, related incidents

Phase 5: Follow-Through

  • Share the post-mortem widely

    Other teams can learn from your experience

  • Enter action items in your tracking system

    Treat them like any other work item

  • Schedule follow-up reviews

    Check that actions are completed on schedule

  • Verify action effectiveness

    Did the changes actually prevent recurrence?

  • Update runbooks and documentation

    Incorporate lessons learned into operational docs

  • Add to lessons learned library

    Make post-mortems searchable for future reference

Post-Mortem Document Template

Post-Mortem: [Incident Title]

Date

[Date of incident]

Author

[Post-mortem lead]

Status

[Draft / Final / Actions Complete]


Summary

[1-2 paragraph summary: what happened, impact, and key takeaway]

Impact

  • Duration: [X hours/minutes]
  • Users affected: [number or percentage]
  • Services affected: [list]
  • Business impact: [revenue, SLA, etc.]

Timeline

[Detailed chronological timeline with timestamps]

Root Causes

[List of contributing factors at technical and systemic levels]

What Went Well

[Effective response actions]

What Could Be Improved

[Areas for improvement in detection, response, or prevention]

Action Items

ActionOwnerDue DateStatus
[Action description][Name][Date][Status]

Lessons Learned

[Key insights for the organization]

Common Post-Mortem Anti-Patterns

The Blame Game

If your post-mortem includes phrases like "Bob should have..." or "The team failed to...", you're doing it wrong. Reframe: What systemic factors allowed this to happen? What guardrails were missing?

Example: A team's post-mortem for a database outage focused entirely on the junior engineer who ran a migration without a backup. The real cause—no migration safety checks in the deployment pipeline—went unaddressed. The same type of failure happened again two months later.

The Checkbox Exercise

Post-mortems done because policy requires them, not because the team wants to learn, produce shallow analysis and ignored action items. If you're just going through the motions, you're wasting everyone's time.

Example: An organization requires post-mortems for all P1s. The template gets copy-pasted, the "root cause" field says "configuration error," the actions say "be more careful." Three months of post-mortems, zero meaningful improvements.

The Novel

A 20-page post-mortem that nobody reads helps nobody. Be thorough but concise. Focus on what matters for preventing recurrence.

Example: A 15-page post-mortem with full log dumps and every Slack message from the incident channel. Nobody reads past page two. The critical insight on page 12 is never acted on.

The Action Item Graveyard

Post-mortems that produce action items that never get done are theater. Track actions to completion or don't bother writing them down.

Example: A team's post-mortem produces 8 action items. Six months later, only 1 has been completed. The others were deprioritized in favor of feature work. When the same failure happens again, the team writes the same action items.

When to Do a Post-Mortem (and When to Skip)

Not every incident needs a full post-mortem meeting. Over-post-morteming creates "outage fatigue"—teams start treating every review as a chore, and the quality of analysis drops across the board. The key is matching the depth of your review to the severity and novelty of the incident.

Suggested Severity Thresholds

  • P1

    Service down, customer impact

    Full post-mortem with meeting, always. No exceptions. Document the timeline, conduct root cause analysis, assign action items, and schedule follow-up reviews.

  • P2

    Degraded service

    Post-mortem document required. Hold a meeting only if this is a novel failure mode the team hasn't seen before. If it's a well-understood type of failure, the written document is sufficient.

  • P3

    Minor issue, no customer impact

    A brief write-up is enough. No meeting needed. Capture what happened and any quick fixes, but don't burn an hour of the team's time on a non-event.

  • Repeat

    Recurring incidents regardless of severity

    Always post-mortem recurring issues, even if they're P3. Repeated incidents indicate systemic problems that won't go away on their own. If the same class of failure keeps happening, something is broken in the process, architecture, or tooling.

For lower-severity incidents that still warrant a review, consider using a lighter-weight format. Our guide to the 30-minute post-mortem provides a streamlined approach that captures the essential learnings without the overhead of a full post-mortem process.

Measuring Post-Mortem Quality

How do you know if your post-mortems are effective? Consider tracking these incident management metrics:

  • Action completion rate: What percentage of actions get done?
  • Time to completion: How long do actions take to implement?
  • Recurrence rate: Do similar incidents happen again?
  • Investigation depth: Are you finding systemic causes or just immediate triggers?

Some teams implement quality scoring for post-mortems, measuring completeness of timelines, depth of root cause analysis, specificity of actions, and follow-through on implementation. If your meetings are running too long, our guide to the 30-minute post-mortem shows how to keep reviews focused and productive.

Rigor Scoring: An Objective Quality Measurement

One of the most effective ways to improve post-mortem quality over time is to apply a rigor score—an objective measurement of how thorough and actionable a post-mortem actually is. Rather than relying on subjective assessments like "this post-mortem felt good," rigor scoring evaluates specific dimensions:

  • Timeline completeness: Are there gaps in the timeline? A good post-mortem accounts for the full incident lifecycle from first signal to resolution, with no unexplained jumps.
  • Root cause depth: Did the analysis get past the immediate trigger to the underlying systemic causes? "A bad config was deployed" is surface-level. "Our config change process lacks validation and rollback capabilities" is systemic.
  • Action specificity: Are the action items measurable and verifiable? "Improve monitoring" is vague. "Add latency alerting threshold at p99 > 500ms on the payments service by March 15" is specific.
  • Follow-through rate: Did the team actually complete the actions they committed to? A post-mortem with perfect analysis but 0% follow-through produced no value.

Tracking rigor scores across your post-mortems reveals trends: Are your reviews getting more thorough over time? Are certain teams consistently producing shallow analysis? Are action items actually getting done? OutageReview provides automated rigor scoring that evaluates each post-mortem against these dimensions, giving teams an objective baseline and a clear path to improvement.

Frequently Asked Questions

What is a blameless post-mortem?

A blameless post-mortem is an incident review conducted under the principle that individuals are not punished for honest mistakes. Instead of asking "who caused this?", the investigation focuses on "what systemic factors allowed this to happen?" This approach, popularized by Google's SRE practices, encourages honest information sharing and leads to more effective preventive actions because people aren't hiding information to protect themselves.

How soon after an incident should you do a post-mortem?

Ideally within 1-3 business days after the incident is resolved. This window balances two competing needs: memories are still fresh (details fade quickly after a week), but emotions have cooled enough for productive discussion. Start evidence preservation immediately—save logs, chat transcripts, and metrics before they expire—but schedule the actual meeting for 24-72 hours after resolution.

Who should attend a post-mortem meeting?

Include the incident responders (those who actively worked to resolve it), engineers who own the affected services, the on-call engineer, and a facilitator. Optionally invite affected stakeholders (product managers, customer support leads) for impact context. Keep the group under 8-10 people—larger groups make discussion difficult. People who weren't involved but want to learn can read the written post-mortem document afterward.

How long should a post-mortem meeting take?

30-60 minutes for a well-prepared post-mortem. If your meetings routinely exceed an hour, the preparation is likely insufficient—timeline reconstruction and evidence gathering should happen asynchronously before the meeting. The meeting itself should focus on analysis (why it happened) and action planning (what to do about it), not on reconstructing what happened. See our guide to the 30-minute post-mortem for tips on keeping meetings focused.

Key Takeaways

  • Start within 24 hours while details are fresh
  • Build a detailed timeline before analyzing causes
  • Focus on systems and processes, not individual blame
  • Track action items to completion, then verify they worked

Ready to improve your incident investigations?

OutageReview gives you 5 Whys, Fishbone diagrams, timeline tracking, and rigor scoring in one platform.

Start Your 14-Day Free Trial

No credit card required