A post-mortem done well transforms an incident from a painful experience into a learning opportunity. Done poorly, it becomes a blame session that teaches nothing and damages team trust. This checklist covers everything you need to run effective, blameless post-mortems that actually prevent future incidents.
What Makes a Good Post-Mortem?
An effective post-mortem answers three questions: What happened? Why did it happen? And what will we do to prevent it from happening again? But the format matters less than the culture. A post-mortem in a blame-focused organization will always be shallow because people won't share honestly.
The goal isn't to find who was at fault. It's to understand what systemic factors allowed the incident to happen and what changes will make the system more resilient.
Blameless Doesn't Mean Accountable-less
Blameless culture means we don't punish individuals for honest mistakes. It doesn't mean nobody is responsible for fixing things. Every action item should have an owner. Every improvement should have accountability. We just don't blame people for the incidents that reveal the need for those improvements.
Complete Post-Mortem Checklist
Phase 1: Immediate (Within 24 Hours)
Confirm the incident is fully resolved
Don't start the post-mortem while still firefighting
Assign a post-mortem lead
Someone to gather data, schedule the meeting, and drive the process
Preserve evidence
Save logs, metrics, chat transcripts, and any temporary debugging artifacts before they expire or get cleaned up
Document immediate facts while fresh
Ask responders to write brief notes about what they did and observed
Schedule the post-mortem meeting
Aim for 1-3 days after resolution, while details are fresh but emotions have cooled
Phase 2: Preparation (Before the Meeting)
Build the incident timeline
Gather logs, metrics, and communications into a chronological sequence
Identify key timestamps
When did the incident start? When was it detected? When was it resolved?
Calculate impact metrics
Duration, affected users/requests, error rates, revenue impact if applicable
Identify attendees
Include responders, relevant engineers, and affected stakeholders
Share pre-read materials
Send the timeline and basic facts so people come prepared
Review previous similar incidents
Check if this is a pattern or first occurrence
Timeline Building Tip
Building timelines from raw logs can take hours of tedious work. Consider using tools that can parse logs and extract events automatically. The key is to verify the extracted timeline and add context that only your team knows, like who made decisions and why.
Phase 3: The Post-Mortem Meeting
Set the tone
Remind everyone this is blameless; we're here to learn, not to punish
Review the timeline together
Walk through what happened, filling in gaps and correcting errors
Conduct root cause analysis
Use 5 Whys, Fishbone, or other techniques to dig into causes
Identify multiple contributing factors
Don't stop at one cause; complex incidents have multiple factors
Discuss what went well
Acknowledge effective response actions; reinforce good practices
Brainstorm preventive actions
What changes would prevent this incident or detect it faster?
Prioritize action items
Not everything can be fixed immediately; focus on highest impact
Assign owners and deadlines
Every action item needs a specific owner and due date
Phase 4: Documentation
Write a clear incident summary
One paragraph that anyone can understand
Document the complete timeline
Include timestamps, events, and decision points
Record root causes
Both immediate and underlying systemic causes
Document impact
Duration, affected services, user impact, business impact
List all action items
With owners, due dates, and priority
Include lessons learned
What should others know about this type of incident?
Link to relevant resources
Dashboards, runbooks, code changes, related incidents
Phase 5: Follow-Through
Share the post-mortem widely
Other teams can learn from your experience
Enter action items in your tracking system
Treat them like any other work item
Schedule follow-up reviews
Check that actions are completed on schedule
Verify action effectiveness
Did the changes actually prevent recurrence?
Update runbooks and documentation
Incorporate lessons learned into operational docs
Add to lessons learned library
Make post-mortems searchable for future reference
Post-Mortem Document Template
Post-Mortem: [Incident Title]
Date
[Date of incident]
Author
[Post-mortem lead]
Status
[Draft / Final / Actions Complete]
Summary
[1-2 paragraph summary: what happened, impact, and key takeaway]
Impact
- Duration: [X hours/minutes]
- Users affected: [number or percentage]
- Services affected: [list]
- Business impact: [revenue, SLA, etc.]
Timeline
[Detailed chronological timeline with timestamps]
Root Causes
[List of contributing factors at technical and systemic levels]
What Went Well
[Effective response actions]
What Could Be Improved
[Areas for improvement in detection, response, or prevention]
Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| [Action description] | [Name] | [Date] | [Status] |
Lessons Learned
[Key insights for the organization]
Common Post-Mortem Anti-Patterns
The Blame Game
If your post-mortem includes phrases like "Bob should have..." or "The team failed to...", you're doing it wrong. Reframe: What systemic factors allowed this to happen? What guardrails were missing?
The Checkbox Exercise
Post-mortems done because policy requires them, not because the team wants to learn, produce shallow analysis and ignored action items. If you're just going through the motions, you're wasting everyone's time.
The Novel
A 20-page post-mortem that nobody reads helps nobody. Be thorough but concise. Focus on what matters for preventing recurrence.
The Action Item Graveyard
Post-mortems that produce action items that never get done are theater. Track actions to completion or don't bother writing them down.
Measuring Post-Mortem Quality
How do you know if your post-mortems are effective? Consider tracking:
- Action completion rate: What percentage of actions get done?
- Time to completion: How long do actions take to implement?
- Recurrence rate: Do similar incidents happen again?
- Investigation depth: Are you finding systemic causes or just immediate triggers?
Some teams implement quality scoring for post-mortems, measuring completeness of timelines, depth of root cause analysis, specificity of actions, and follow-through on implementation.
Key Takeaways
- Start within 24 hours while details are fresh
- Build a detailed timeline before analyzing causes
- Focus on systems and processes, not individual blame
- Track action items to completion, then verify they worked