Postmortems: What to Do After an Incident | Operations And Monitoring Field Guide

The outage is over. Systems are restored. Everyone is tired.

Now what?

Most SMBs move on. The incident is closed. Life returns to normal until the next incident.

That's the expensive approach. The smart approach is to learn from what happened.

What this solves

Prevents recurrence. The same bug that caused this outage will cause another if you don't fix it.

Improves processes. Your incident response was probably messy. Documenting what worked and what didn't makes the next one smoother.

Creates institutional knowledge. When the next incident happens, you won't have to figure everything out from scratch.

Builds accountability. Postmortems create records. People know what happened, who was responsible, and what they committed to fixing.

Supports compliance. Cyber insurance and some frameworks require documented incident reviews.

What can go wrong

No postmortem. Incidents get fixed and forgotten. The same problems recur.

Blame-focused postmortems. "Who did this?" instead of "What happened?" Blame makes people hide problems.

Action items that never happen. You find problems, write them down, and nothing gets fixed. Postmortems become theater.

Too long before the postmortem. Waiting weeks means people forget details. Do it within a week.

Not including the right people. The person who caused the incident might be the best person to prevent the next one. Include them.

How to run a useful postmortem

Timing

Schedule within 5 business days of incident resolution. People still remember. Logs are still available.

Attendees

Incident commander
People who responded
People who were affected
Decision makers who need to approve follow-up work
NOT just managers and executives — include the people who were in the trenches

Format: blameless, factual, forward-looking

Blameless. The goal is to find system failures, not individual failures. People make mistakes. Systems should prevent or limit those mistakes.

Factual. Base discussion on evidence: logs, timestamps, documented events. Not memory or assumption.

Forward-looking. What can we do to prevent this? What can we do to respond faster? Focus on actions, not blame.

Document the timeline

When did the incident start?
When was it detected?
When was it reported?
When did response begin?
What actions were taken and when?
When was it resolved?

Identify root cause

Why did this happen? Not "the server crashed" — that's what happened. Why did the server crash?

Common root causes:

Missing monitoring — nobody knew until customers complained
Missing backups — restore took longer because backups were incomplete
Undocumented changes — someone changed something and didn't tell anyone
No rollback plan — the update broke things and they couldn't undo it
Single point of failure — one component took down everything

Identify contributing factors

What made this worse than it needed to be?

Slow detection time
No clear escalation path
Missing documentation
Communication gaps

Create action items

For each problem identified, create an action item:

What needs to be done?
Who's responsible?
By when?
How will we verify it's complete?

Follow through

Review action items at the next monthly ops review. If they're not done, find out why. Make someone accountable.

Postmortem template

Incident: [Name]
Date: [Date of incident]
Duration: [How long]
Impact: [What was affected, customers, revenue, etc.]
Severity: [Critical / High / Medium / Low]

Timeline:
- HH:MM - What happened

Root Cause:
- Why did this happen?

Contributing Factors:
- What made it worse?

What Worked:
- What response was effective?

Action Items:
1. [Action] - Owner: [Name] - Due: [Date]
2. [Action] - Owner: [Name] - Due: [Date]

When to skip the postmortem

Minor incidents that caused no impact and were resolved in minutes don't need formal postmortems. Use judgment. A 30-second blip that nobody noticed doesn't need a 2-hour meeting.

But any incident that:

Caused customer-visible downtime
Required emergency response
Took more than 30 minutes to resolve
Cost the business money

...deserves a postmortem.

What it costs (honest ranges)

Time only. One hour meeting, one hour documentation. Maybe 2-3 hours of follow-up work.

No postmortem cost. The cost is future incidents that could have been prevented.

The goal isn't to assign blame. It's to make your systems and processes better. Every incident is a gift from your past self, warning you about problems that need fixing. Listen to it.