Alerting
Automatically notifying engineers when metrics cross predefined thresholds. Good alerts are actionable, not noisy. PagerDuty and Opsgenie route alerts to the right on-call person.
What is Alerting?
Automatically notifying engineers when metrics cross predefined thresholds. Good alerts are actionable, not noisy. PagerDuty and Opsgenie route alerts to the right on-call person.
Alerting is a intermediate-level concept that sits in the Observability & Monitoring area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Alerting" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Alerting in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Alerting lessonRelated lessons
Lessons that touch on Alerting as part of a larger topic.
Design a Metrics/Monitoring System
Design a metrics collection and monitoring system - time-series databases, aggregation pipelines, alerting, and dashboards at scale
capstone · capstone
Alert Fatigue Prevention
Design alerting to maintain signal quality, because ignored alerts are worse than no alerts
advanced · reliability resilience
Real-Time Analytics
Query streaming data with sub-second latency, dashboards, alerting, and live monitoring
advanced · stream batch processing
See also
Related glossary terms you might want to look up next.
Metrics
Numerical measurements collected over time that describe system behavior: request rate, error rate, latency percentiles, CPU utilization. Prometheus is the standard collector.
Observability
The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.
SLO
Service Level Objective: a target value for an SLI, like '99.9% of requests under 300ms.' The internal engineering goal that drives reliability investment.