Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between monitoring and alerting?

Monitoring is collecting and graphing data about your system over time. Alerting is the decision layer on top: it watches that data and notifies a human only when a specific condition is bad enough to need action right now.

What makes a good alert?

It is actionable, urgent, and tied to something users feel. A good alert means a human needs to do something immediately, points to a runbook, and rarely fires without a real problem behind it. If it fires and nobody acts, it should be a dashboard or ticket, not a page.

What is alert fatigue and how do you fix it?

Alert fatigue is when so many noisy or non-actionable alerts fire that engineers start ignoring the pager and miss real incidents. Fix it by deleting alerts that never lead to action, adding a duration so single spikes do not page, grouping related alerts, and alerting on symptoms instead of every internal metric.

Should I alert on CPU usage or on user-facing errors?

Prefer user-facing symptoms like error rate and latency. High CPU is often normal and not directly actionable, while a spike in failed requests always matters. Alert on what users experience, and use resource metrics for debugging once you are already investigating.

What is burn-rate alerting?

Burn-rate alerting fires based on how fast you are consuming your error budget rather than a fixed threshold. If you are spending a month of allowed errors in an hour, page immediately; if you are spending it slowly, open a ticket instead. It reduces both false pages and missed slow-burning outages.

IntermediateObservability & Monitoring

Alerting

Automatically notifying engineers when metrics cross predefined thresholds. Good alerts are actionable, not noisy. PagerDuty and Opsgenie route alerts to the right on-call person.

What is Alerting?

In short

Alerting is the part of a monitoring system that automatically notifies on-call engineers when a metric crosses a defined threshold or a condition holds true for some duration, so problems get attention before users complain. A good alert fires only when a human needs to take action right now, and it routes through a tool like PagerDuty or Opsgenie to the right person.

What alerting actually is

Alerting is the layer that sits on top of your metrics and logs and decides when to wake somebody up. You collect numbers like error rate, request latency, queue depth, or disk usage. An alerting rule watches those numbers and fires a notification when a condition is met, for example error rate above 1 percent for 5 minutes.

The key word is action. An alert is a promise that a human should do something. If nobody needs to respond, it should be a dashboard or a daily report, not an alert. The classic failure mode is alerting on causes you cannot act on (CPU at 80 percent) instead of symptoms users feel (checkout requests failing).

Alerting is distinct from monitoring. Monitoring is recording and graphing what is happening. Alerting is the judgment call on top of that data: this specific thing, right now, is bad enough to interrupt a person.

How it works under the hood

A typical pipeline has three stages. First, a rule engine evaluates conditions on a schedule. In Prometheus this is an alerting rule that runs an expression every 15 to 30 seconds and marks a rule as pending, then firing once the condition has held for the configured `for` duration. The duration matters: it prevents a single noisy data point from paging anyone.

Second, firing alerts are sent to a router. Prometheus ships them to Alertmanager, which groups related alerts together, deduplicates copies coming from many instances, silences known issues during maintenance, and applies inhibition so that a big outage does not also fire 50 child alerts. Datadog, Grafana, and New Relic have their own equivalents built in.

Third, the router hands off to a paging tool such as PagerDuty or Opsgenie. The paging tool owns the on-call schedule, the escalation policy, and the actual phone call, SMS, push, or Slack message. If the first engineer does not acknowledge within, say, 5 minutes, it escalates to the next person, then to the team lead.

Two common rule styles are threshold alerts (latency p99 over 500ms) and rate-of-change or anomaly alerts (traffic dropped 40 percent versus the same hour last week). Threshold rules are simple and predictable; anomaly rules catch problems you did not anticipate but produce more false positives.

When to use it and the trade-offs

Alert on symptoms that map to your service level objectives, not on every internal metric. Google's site reliability practice recommends alerting on a fast-burning error budget: if you are burning a month of allowed errors in an hour, page; if you are burning slowly, open a ticket instead of waking someone at 3 AM.

The biggest trap is alert fatigue. When a team gets paged for things that are not actionable, people start ignoring the pager, and the one real alert gets lost in the noise. The fix is ruthless pruning: every alert that fires without anyone doing anything should be deleted, tuned, or downgraded.

Tuning is a balance between two errors. Make thresholds too tight and you get false positives that train people to ignore alerts. Make them too loose and you miss real incidents. Adding a `for` duration, alerting on multi-window burn rates, and grouping related alerts all reduce noise without missing real failures. A healthy team tracks how many alerts fire per week and what fraction led to real action.

A concrete example

Say you run a payments API with an SLO of 99.9 percent successful requests. You define a metric for failed payments and a Prometheus rule: fire if the failure rate over the last 5 minutes is high enough to burn 2 percent of the monthly error budget in one hour.

At 2 AM a downstream bank gateway starts timing out. Failed payments climb, the rule goes pending, and after holding for 5 minutes it fires. Alertmanager groups all the per-pod alerts into one, checks there is no active maintenance silence, and forwards a single alert to PagerDuty.

PagerDuty looks at the on-call schedule, calls the primary engineer, and posts to the team's Slack channel with a link to the runbook and the relevant dashboard. The engineer acknowledges, sees the gateway is the cause, fails over to the backup gateway, and the error rate drops. The alert auto-resolves when the failure rate returns to normal, and the incident is logged for the next morning's review.

Where it is used in production

PagerDuty

Owns the on-call schedule and escalation; turns a firing alert into a phone call, SMS, or push and escalates if nobody acknowledges in time.

Opsgenie

Atlassian's paging tool that routes alerts to the right on-call engineer with escalation policies and Slack and Jira integration.

Prometheus and Alertmanager

Prometheus evaluates alerting rules on metrics; Alertmanager groups, deduplicates, silences, and routes the firing alerts to paging tools.

Datadog

Built-in monitors fire on thresholds or anomaly detection across metrics, logs, and traces, then notify via Slack, email, or PagerDuty.

Frequently asked questions

What is the difference between monitoring and alerting?: Monitoring is collecting and graphing data about your system over time. Alerting is the decision layer on top: it watches that data and notifies a human only when a specific condition is bad enough to need action right now.
What makes a good alert?: It is actionable, urgent, and tied to something users feel. A good alert means a human needs to do something immediately, points to a runbook, and rarely fires without a real problem behind it. If it fires and nobody acts, it should be a dashboard or ticket, not a page.
What is alert fatigue and how do you fix it?: Alert fatigue is when so many noisy or non-actionable alerts fire that engineers start ignoring the pager and miss real incidents. Fix it by deleting alerts that never lead to action, adding a duration so single spikes do not page, grouping related alerts, and alerting on symptoms instead of every internal metric.
Should I alert on CPU usage or on user-facing errors?: Prefer user-facing symptoms like error rate and latency. High CPU is often normal and not directly actionable, while a spike in failed requests always matters. Alert on what users experience, and use resource metrics for debugging once you are already investigating.
What is burn-rate alerting?: Burn-rate alerting fires based on how fast you are consuming your error budget rather than a fixed threshold. If you are spending a month of allowed errors in an hour, page immediately; if you are spending it slowly, open a ticket instead. It reduces both false pages and missed slow-burning outages.

Learn Alerting hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Alerting lesson See pricing

Lessons that touch on Alerting as part of a larger topic.

What alerting actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

PagerDuty

Owns the on-call schedule and escalation; turns a firing alert into a phone call, SMS, or push and escalates if nobody acknowledges in time.

Opsgenie

Atlassian's paging tool that routes alerts to the right on-call engineer with escalation policies and Slack and Jira integration.

Prometheus and Alertmanager

Prometheus evaluates alerting rules on metrics; Alertmanager groups, deduplicates, silences, and routes the firing alerts to paging tools.

Datadog

Built-in monitors fire on thresholds or anomaly detection across metrics, logs, and traces, then notify via Slack, email, or PagerDuty.

Frequently asked questions

What is the difference between monitoring and alerting?: Monitoring is collecting and graphing data about your system over time. Alerting is the decision layer on top: it watches that data and notifies a human only when a specific condition is bad enough to need action right now.
What makes a good alert?: It is actionable, urgent, and tied to something users feel. A good alert means a human needs to do something immediately, points to a runbook, and rarely fires without a real problem behind it. If it fires and nobody acts, it should be a dashboard or ticket, not a page.
What is alert fatigue and how do you fix it?: Alert fatigue is when so many noisy or non-actionable alerts fire that engineers start ignoring the pager and miss real incidents. Fix it by deleting alerts that never lead to action, adding a duration so single spikes do not page, grouping related alerts, and alerting on symptoms instead of every internal metric.
Should I alert on CPU usage or on user-facing errors?: Prefer user-facing symptoms like error rate and latency. High CPU is often normal and not directly actionable, while a spike in failed requests always matters. Alert on what users experience, and use resource metrics for debugging once you are already investigating.
What is burn-rate alerting?: Burn-rate alerting fires based on how fast you are consuming your error budget rather than a fixed threshold. If you are spending a month of allowed errors in an hour, page immediately; if you are spending it slowly, open a ticket instead. It reduces both false pages and missed slow-burning outages.

Learn Alerting hands-on

Open the Alerting lesson See pricing

Alerting

What is Alerting?

What alerting actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

Alerting

What is Alerting?

What alerting actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

What is Alerting?

What alerting actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Alerting?

What alerting actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also