MTTD
Mean Time To Detect: the average time between a failure occurring and being noticed. Shorter MTTD means better monitoring and alerting. You can't fix what you don't know is broken.
What is MTTD?
Mean Time To Detect: the average time between a failure occurring and being noticed. Shorter MTTD means better monitoring and alerting. You can't fix what you don't know is broken.
MTTD is a advanced concept that sits in the Reliability & Resilience area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "MTTD" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn MTTD in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the MTTD lessonRelated lessons
Lessons that touch on MTTD as part of a larger topic.
See also
Related glossary terms you might want to look up next.
MTTR
Mean Time To Recovery: the average time from when a failure is detected to when the service is restored. A key reliability metric that drives investment in automation and runbooks.
Alerting
Automatically notifying engineers when metrics cross predefined thresholds. Good alerts are actionable, not noisy. PagerDuty and Opsgenie route alerts to the right on-call person.
Observability
The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.