Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between MTTD and MTTR?

MTTD measures how long a failure stays unnoticed, starting at onset and ending at detection. MTTR measures how long it takes to fix the problem once it is known. MTTD is about visibility; MTTR is about repair speed. Total user-facing downtime includes both, so you need to lower each one separately.

What is a good MTTD value?

There is no universal number; it depends on the system. For critical user-facing services, mature teams aim to detect within seconds to a couple of minutes using symptom-based and synthetic monitoring. The honest target is detecting faster than your users do, so failures never get reported by a customer before your alerts fire.

How do you actually calculate MTTD?

For each incident, take the gap between the true failure start time (reconstructed from logs, metrics, and traces) and the timestamp of the first alert or report. Average those gaps across all incidents in a chosen window, such as the last 30 or 90 days. Incident tools record these timestamps so the average can be computed automatically.

Why can lowering MTTD too aggressively backfire?

Making alerts extremely sensitive to cut detection time generates more false positives. Too many false alarms cause alert fatigue, where on-call engineers start ignoring the pager, and a real failure ends up buried in noise. The right balance is the fastest detection you can achieve before false alarms erode trust in the alerts.

Does MTTD include the time to acknowledge an incident?

No. MTTD ends the moment the failure is detected, typically when the first alert fires. The time between detection and a human acknowledging it is MTTA (Mean Time To Acknowledge), a separate metric. Keeping them distinct shows whether your delay is in monitoring or in human response.

AdvancedReliability & Resilience

MTTD

Mean Time To Detect: the average time between a failure occurring and being noticed. Shorter MTTD means better monitoring and alerting. You can't fix what you don't know is broken.

What is MTTD?

In short

MTTD (Mean Time To Detect) is the average time between when a failure starts and when your team or monitoring system first notices it. A lower MTTD means problems get caught faster, which shrinks the total time an incident hurts users.

What MTTD Actually Measures

MTTD is one of the four core incident metrics teams track, alongside MTTA (time to acknowledge), MTTR (time to repair or resolve), and MTBF (mean time between failures). It answers a single question: from the moment something broke, how long did it stay invisible?

The clock starts at the real onset of the problem, not when someone files a ticket. If a database started returning errors at 02:14 and an engineer first saw the alert at 02:31, the detection time for that incident is 17 minutes. MTTD is the average of those numbers across many incidents over a window like 30 or 90 days.

It matters because detection time is pure dead time. While a failure goes unnoticed, no one is acknowledging it, no one is fixing it, and users are absorbing the damage. You cannot fix what you do not know is broken, so MTTD sets a floor on how good your incident response can ever be.

How Teams Measure and Shrink It

Calculating MTTD requires knowing the true start time of each incident, which is the hard part. Teams reconstruct it after the fact from logs, metric graphs, and traces, then compare it to the timestamp of the first alert or first human report. Tools like PagerDuty, Datadog, and Opsgenie record these timestamps automatically so the gap can be computed without guesswork.

The lever for lowering MTTD is observability: metrics, logs, and traces feeding alerts that fire on symptoms users feel, not just on raw resource numbers. An alert on a 5 percent jump in HTTP 500 rate or a p99 latency breach detects real pain in seconds. An alert that only watches CPU usage can miss an outage entirely if the broken service is sitting idle.

Good signal-to-noise is just as important. If on-call engineers get hundreds of low-value alerts a day, they tune them out, and real failures sit in the noise for long stretches. Synthetic checks, anomaly detection, and SLO-based burn-rate alerts (popularized by Google's SRE practice) all push detection earlier without flooding the pager.

The trade-off is sensitivity versus noise. Tighter, faster thresholds cut MTTD but raise false positives and alert fatigue. Looser thresholds keep the pager quiet but let slow-burning problems hide. The goal is the lowest MTTD you can hit before false alarms start eroding trust in the alerts.

When to Focus on MTTD

Look at MTTD when incident postmortems keep showing that the failure ran for a long time before anyone noticed. If repairs are fast once a problem is found but the discovery itself is slow, the fix is better monitoring, not a faster runbook.

MTTD is most valuable for user-facing services where minutes of silent failure translate directly into lost revenue or broken trust. For a payment API or a checkout flow, detecting an outage in 30 seconds versus 20 minutes is the difference between a blip and a headline.

It is less useful in isolation. A 10 second MTTD is worthless if it then takes 4 hours to resolve, so MTTD is read together with MTTA and MTTR to see where the real time is going. Chasing a lower MTTD also has diminishing returns once detection is faster than a human can reasonably act, at which point effort is better spent on automated remediation.

A Concrete Example

Imagine a streaming service where a recommendation service starts timing out at 21:00 on a Friday. Without symptom-based monitoring, the team only finds out at 21:40 when support tickets pile up, giving an MTTD of 40 minutes during peak traffic.

After the postmortem they add an SLO burn-rate alert on the recommendation API's error budget and a synthetic test that loads the homepage every 30 seconds. The next time the same failure happens, the synthetic check fails on the first run and the alert pages on-call within a minute. MTTD drops from 40 minutes to roughly 1 minute.

Nothing about the actual repair changed, but the total customer-facing outage shrank by nearly 39 minutes simply because the problem stopped hiding. That is the entire value of investing in detection.

Where it is used in production

PagerDuty

Timestamps incident start, acknowledgement, and resolution so teams can compute MTTD and the other incident metrics automatically.

Datadog

Detects failures through metric monitors, anomaly detection, and synthetic checks, and reports detection-time analytics across incidents.

Google SRE

Uses SLO error-budget burn-rate alerts to catch reliability problems early while keeping pager noise low, directly lowering MTTD.

Grafana

Visualizes metrics and fires alerts on symptom thresholds and anomalies, surfacing incidents before users report them.

Frequently asked questions

What is the difference between MTTD and MTTR?: MTTD measures how long a failure stays unnoticed, starting at onset and ending at detection. MTTR measures how long it takes to fix the problem once it is known. MTTD is about visibility; MTTR is about repair speed. Total user-facing downtime includes both, so you need to lower each one separately.
What is a good MTTD value?: There is no universal number; it depends on the system. For critical user-facing services, mature teams aim to detect within seconds to a couple of minutes using symptom-based and synthetic monitoring. The honest target is detecting faster than your users do, so failures never get reported by a customer before your alerts fire.
How do you actually calculate MTTD?: For each incident, take the gap between the true failure start time (reconstructed from logs, metrics, and traces) and the timestamp of the first alert or report. Average those gaps across all incidents in a chosen window, such as the last 30 or 90 days. Incident tools record these timestamps so the average can be computed automatically.
Why can lowering MTTD too aggressively backfire?: Making alerts extremely sensitive to cut detection time generates more false positives. Too many false alarms cause alert fatigue, where on-call engineers start ignoring the pager, and a real failure ends up buried in noise. The right balance is the fastest detection you can achieve before false alarms erode trust in the alerts.
Does MTTD include the time to acknowledge an incident?: No. MTTD ends the moment the failure is detected, typically when the first alert fires. The time between detection and a human acknowledging it is MTTA (Mean Time To Acknowledge), a separate metric. Keeping them distinct shows whether your delay is in monitoring or in human response.

Learn MTTD hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the MTTD lesson See pricing

Lessons that touch on MTTD as part of a larger topic.

MTTD (Mean Time to Detect)
How long it takes to realize something is broken, the gap between failure and awareness
intermediate · observability monitoring

What MTTD Actually Measures

How Teams Measure and Shrink It

When to Focus on MTTD

A Concrete Example

Where it is used in production

PagerDuty

Timestamps incident start, acknowledgement, and resolution so teams can compute MTTD and the other incident metrics automatically.

Datadog

Detects failures through metric monitors, anomaly detection, and synthetic checks, and reports detection-time analytics across incidents.

Google SRE

Uses SLO error-budget burn-rate alerts to catch reliability problems early while keeping pager noise low, directly lowering MTTD.

Grafana

Visualizes metrics and fires alerts on symptom thresholds and anomalies, surfacing incidents before users report them.

Frequently asked questions

What is the difference between MTTD and MTTR?: MTTD measures how long a failure stays unnoticed, starting at onset and ending at detection. MTTR measures how long it takes to fix the problem once it is known. MTTD is about visibility; MTTR is about repair speed. Total user-facing downtime includes both, so you need to lower each one separately.
What is a good MTTD value?: There is no universal number; it depends on the system. For critical user-facing services, mature teams aim to detect within seconds to a couple of minutes using symptom-based and synthetic monitoring. The honest target is detecting faster than your users do, so failures never get reported by a customer before your alerts fire.
How do you actually calculate MTTD?: For each incident, take the gap between the true failure start time (reconstructed from logs, metrics, and traces) and the timestamp of the first alert or report. Average those gaps across all incidents in a chosen window, such as the last 30 or 90 days. Incident tools record these timestamps so the average can be computed automatically.
Why can lowering MTTD too aggressively backfire?: Making alerts extremely sensitive to cut detection time generates more false positives. Too many false alarms cause alert fatigue, where on-call engineers start ignoring the pager, and a real failure ends up buried in noise. The right balance is the fastest detection you can achieve before false alarms erode trust in the alerts.
Does MTTD include the time to acknowledge an incident?: No. MTTD ends the moment the failure is detected, typically when the first alert fires. The time between detection and a human acknowledging it is MTTA (Mean Time To Acknowledge), a separate metric. Keeping them distinct shows whether your delay is in monitoring or in human response.

Learn MTTD hands-on

Open the MTTD lesson See pricing

MTTD

What is MTTD?

What MTTD Actually Measures

How Teams Measure and Shrink It

When to Focus on MTTD

A Concrete Example

Where it is used in production

Frequently asked questions

See also

MTTD

What is MTTD?

What MTTD Actually Measures

How Teams Measure and Shrink It

When to Focus on MTTD

A Concrete Example

Where it is used in production

Frequently asked questions

See also

What is MTTD?

What MTTD Actually Measures

How Teams Measure and Shrink It

When to Focus on MTTD

A Concrete Example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is MTTD?

What MTTD Actually Measures

How Teams Measure and Shrink It

When to Focus on MTTD

A Concrete Example

Where it is used in production

Frequently asked questions

Related lessons

See also