Incident Response
The structured process for detecting, containing, eradicating, and recovering from security incidents. Includes communication plans, runbooks, and post-incident reviews.
What is Incident Response?
The structured process for detecting, containing, eradicating, and recovering from security incidents. Includes communication plans, runbooks, and post-incident reviews.
Incident Response is a advanced concept that sits in the Security Testing & Operations area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Incident Response" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Incident Response in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Incident Response lessonRelated lessons
Lessons that touch on Incident Response as part of a larger topic.
Incident Response
Structured response when things go wrong, roles, communication, and resolution steps
advanced · reliability resilience
Incident Response
Structured processes for detecting, containing, eradicating, and recovering from security incidents
intermediate · security architecture
MTTD (Mean Time to Detect)
How long it takes to realize something is broken, the gap between failure and awareness
intermediate · observability monitoring
MTTA (Mean Time to Acknowledge)
How long until a human says 'I'm on it', the gap between alert and action
intermediate · observability monitoring
MTTR (Mean Time to Resolve)
How long it takes to fix the problem, from acknowledgment to all-clear
intermediate · observability monitoring
See also
Related glossary terms you might want to look up next.
Alerting
Automatically notifying engineers when metrics cross predefined thresholds. Good alerts are actionable, not noisy. PagerDuty and Opsgenie route alerts to the right on-call person.
Observability
The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.
Postmortem
A blameless analysis conducted after an incident to document what happened, why, and how to prevent it from recurring. The most important output is the list of action items.