Postmortem
A blameless analysis conducted after an incident to document what happened, why, and how to prevent it from recurring. The most important output is the list of action items.
What is Postmortem?
A blameless analysis conducted after an incident to document what happened, why, and how to prevent it from recurring. The most important output is the list of action items.
Postmortem is a advanced concept that sits in the Reliability & Resilience area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Postmortem" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Postmortem in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Postmortem lessonRelated lessons
Lessons that touch on Postmortem as part of a larger topic.
See also
Related glossary terms you might want to look up next.
Incident Response
The structured process for detecting, containing, eradicating, and recovering from security incidents. Includes communication plans, runbooks, and post-incident reviews.
MTTR
Mean Time To Recovery: the average time from when a failure is detected to when the service is restored. A key reliability metric that drives investment in automation and runbooks.
SLO
Service Level Objective: a target value for an SLI, like '99.9% of requests under 300ms.' The internal engineering goal that drives reliability investment.