MTTR
Mean Time To Recovery: the average time from when a failure is detected to when the service is restored. A key reliability metric that drives investment in automation and runbooks.
What is MTTR?
Mean Time To Recovery: the average time from when a failure is detected to when the service is restored. A key reliability metric that drives investment in automation and runbooks.
MTTR is a advanced concept that sits in the Reliability & Resilience area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "MTTR" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn MTTR in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the MTTR lessonRelated lessons
Lessons that touch on MTTR as part of a larger topic.
See also
Related glossary terms you might want to look up next.
MTTD
Mean Time To Detect: the average time between a failure occurring and being noticed. Shorter MTTD means better monitoring and alerting. You can't fix what you don't know is broken.
SLO
Service Level Objective: a target value for an SLI, like '99.9% of requests under 300ms.' The internal engineering goal that drives reliability investment.
Incident Response
The structured process for detecting, containing, eradicating, and recovering from security incidents. Includes communication plans, runbooks, and post-incident reviews.