Observability
The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.
What is Observability?
The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.
Observability is a intermediate-level concept that sits in the Observability & Monitoring area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Observability" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Observability in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Observability lessonRelated lessons
Lessons that touch on Observability as part of a larger topic.
Observability Overview
The three pillars of observability, logs, metrics, and traces, and how they work together to make complex systems understandable
intermediate · observability monitoring
OpenTelemetry
The vendor-neutral observability framework, one standard for traces, metrics, and logs across every language and platform
intermediate · observability monitoring
Datadog
The all-in-one observability platform, metrics, logs, traces, and more in a single SaaS product
intermediate · observability monitoring
New Relic
The original APM pioneer, now a full observability platform with a generous free tier
intermediate · observability monitoring
Dynatrace
AI-driven full-stack observability, automatic discovery, baselining, and root cause analysis
intermediate · observability monitoring
See also
Related glossary terms you might want to look up next.
Distributed Tracing
Tracking a request as it flows through multiple services in a distributed system. Each service adds its trace, creating a full picture of the request journey.
Metrics
Numerical measurements collected over time that describe system behavior: request rate, error rate, latency percentiles, CPU utilization. Prometheus is the standard collector.
Logging
Recording discrete events with timestamps, severity levels, and context. Structured logs (JSON) are searchable; unstructured logs (plaintext) are not. Ship them to a central system.