Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between observability and monitoring?

Monitoring watches for known problems using predefined dashboards and alerts, like CPU over 80 percent. Observability lets you investigate unknown problems after the fact using metrics, logs, and traces, so you can answer questions you did not anticipate without shipping new code. Monitoring tells you something broke; observability helps you find out why.

What are the three pillars of observability?

Metrics (aggregated numbers over time, cheap to store, good for alerts), logs (detailed timestamped event records, rich but expensive at volume), and traces (the path of a single request across services, showing where time was spent). They are most useful when correlated through a shared trace ID.

Do I need distributed tracing for a small app?

Usually not. For a single service or monolith, structured logs and a debugger cover most cases. Tracing pays off once a request crosses multiple services, because that is when you cannot tell from logs alone which service caused the slowdown or failure.

What is OpenTelemetry?

OpenTelemetry is a vendor neutral open standard and set of libraries for generating and exporting metrics, logs, and traces. Instrumenting once with OpenTelemetry lets you send the same telemetry to many backends like Prometheus, Jaeger, or Datadog without rewriting your code.

Why is high cardinality a problem in metrics?

Each unique combination of metric labels creates a separate time series. Tagging a metric with an unbounded value like user ID or full URL can produce millions of series, which slows queries and blows up storage costs. Keep metric labels low cardinality and push detailed per request data into logs and traces instead.

IntermediateObservability & Monitoring

Observability

The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.

What is Observability?

In short

Observability is the practice of being able to figure out what a running system is doing internally just by looking at the data it emits, without shipping new code to ask new questions. It rests on three kinds of telemetry: metrics (numbers over time), logs (timestamped event records), and traces (the path of a single request across services).

What observability actually means

The word comes from control theory, where a system is observable if you can infer its internal state from its outputs. Applied to software, it means you can answer questions about why a service is slow, why a request failed, or why memory is climbing, using the telemetry the system already produces. The test of a truly observable system is that you can debug a brand new, unexpected problem without adding new instrumentation first.

This is the difference people draw between monitoring and observability. Monitoring answers questions you knew to ask ahead of time: is CPU above 80 percent, is the error rate over 1 percent. Observability is about the questions you did not predict, like why do checkout requests from one specific mobile app version time out only between 2pm and 3pm. Monitoring tells you that something is wrong. Observability helps you find out why.

In practice teams talk about three pillars: metrics, logs, and traces. They overlap and reinforce each other, and modern tooling increasingly stitches them together so you can jump from a spiking metric to the exact log lines and the exact trace that caused it.

The three pillars under the hood

Metrics are numeric measurements aggregated over time, like request count, p99 latency, or queue depth. They are cheap to store because they are just numbers in time buckets, so you can keep months of them and run fast queries and alerts. The cost is that they lose detail: a counter tells you 500 errors happened, not which user or which request caused them.

Logs are discrete, timestamped records of events, ideally structured as key value pairs (JSON) rather than free text so machines can search and group them. They carry rich context per event but are expensive at volume, which is why high traffic systems sample or rate limit them. A single Google or Netflix scale service can produce terabytes of logs per day.

Traces follow one request as it hops across services. Each unit of work is a span, and spans are linked by a shared trace ID and parent span IDs so the full call tree can be reconstructed. A trace shows you that a 900ms checkout spent 600ms waiting on a downstream inventory call, which a metric alone would never reveal. The open standard for emitting all three is OpenTelemetry, which most vendors now ingest.

The glue is context propagation. A trace ID and request attributes are carried through HTTP headers and message metadata so that the metric, the log line, and the span for the same request can be correlated after the fact.

When to invest and the trade-offs

Observability earns its keep once you run distributed systems where a single user action touches many services. In a single monolith, attaching a debugger and reading a stack trace often suffices. In microservices, the failure is rarely in one place, so you need to see the request flow across boundaries. The more services and the higher the request volume, the more you need traces and structured logs, not just dashboards.

The main trade-off is cost and noise. Capturing every log and every trace at full fidelity is expensive in storage and in the vendor bills, which is why teams use sampling. Head based sampling keeps a fixed percentage of traces decided at the start; tail based sampling keeps traces only if they are slow or errored, which is smarter but needs a buffer to hold spans until the request finishes. Either way you accept that you might miss the one trace you wanted.

There is also instrumentation overhead. Adding spans and structured logging costs developer time and a small amount of runtime CPU and memory. Cardinality is the sharpest pitfall: tagging a metric with something unbounded like user ID or request URL can explode the number of time series into the millions and bankrupt your metrics backend. The discipline is to keep high cardinality data in logs and traces, and keep metrics low cardinality.

A concrete example

Imagine a checkout request that normally takes 200ms suddenly shows a p99 of 3 seconds on a dashboard. That spike is a metric. You click into it and the tool surfaces the traces from that window, and you see that 90 percent of the slow traces have one span in common: a call to the payments service that is taking 2.8 seconds.

You open the logs filtered by that trace ID and find repeated lines saying the database connection pool is exhausted. Now you have the chain: a metric told you something was wrong, a trace told you where, and logs told you why. That round trip from symptom to root cause, in minutes rather than hours, is exactly what observability is for.

This is how teams at companies running thousands of services actually operate. Uber, for example, built distributed tracing precisely because no human could reason about a request crossing dozens of microservices from logs alone.

Where it is used in production

Prometheus

The de facto open source metrics system; scrapes numeric time series and powers alerting, originally built at SoundCloud and now a CNCF project.

Grafana

Visualizes metrics, logs (via Loki), and traces (via Tempo) in one place, the dashboard layer most teams put on top of their telemetry.

Jaeger

Open source distributed tracing system created at Uber to follow requests across hundreds of microservices and find latency bottlenecks.

Datadog

A commercial platform that unifies metrics, logs, traces, and alerting, widely used by teams that prefer a managed service over self hosting.

Frequently asked questions

What is the difference between observability and monitoring?: Monitoring watches for known problems using predefined dashboards and alerts, like CPU over 80 percent. Observability lets you investigate unknown problems after the fact using metrics, logs, and traces, so you can answer questions you did not anticipate without shipping new code. Monitoring tells you something broke; observability helps you find out why.
What are the three pillars of observability?: Metrics (aggregated numbers over time, cheap to store, good for alerts), logs (detailed timestamped event records, rich but expensive at volume), and traces (the path of a single request across services, showing where time was spent). They are most useful when correlated through a shared trace ID.
Do I need distributed tracing for a small app?: Usually not. For a single service or monolith, structured logs and a debugger cover most cases. Tracing pays off once a request crosses multiple services, because that is when you cannot tell from logs alone which service caused the slowdown or failure.
What is OpenTelemetry?: OpenTelemetry is a vendor neutral open standard and set of libraries for generating and exporting metrics, logs, and traces. Instrumenting once with OpenTelemetry lets you send the same telemetry to many backends like Prometheus, Jaeger, or Datadog without rewriting your code.
Why is high cardinality a problem in metrics?: Each unique combination of metric labels creates a separate time series. Tagging a metric with an unbounded value like user ID or full URL can produce millions of series, which slows queries and blows up storage costs. Keep metric labels low cardinality and push detailed per request data into logs and traces instead.

Learn Observability hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Observability lesson See pricing

Lessons that touch on Observability as part of a larger topic.

What observability actually means

The three pillars under the hood

When to invest and the trade-offs

A concrete example

Where it is used in production

Prometheus

The de facto open source metrics system; scrapes numeric time series and powers alerting, originally built at SoundCloud and now a CNCF project.

Grafana

Visualizes metrics, logs (via Loki), and traces (via Tempo) in one place, the dashboard layer most teams put on top of their telemetry.

Jaeger

Open source distributed tracing system created at Uber to follow requests across hundreds of microservices and find latency bottlenecks.

Datadog

A commercial platform that unifies metrics, logs, traces, and alerting, widely used by teams that prefer a managed service over self hosting.

Frequently asked questions

What is the difference between observability and monitoring?: Monitoring watches for known problems using predefined dashboards and alerts, like CPU over 80 percent. Observability lets you investigate unknown problems after the fact using metrics, logs, and traces, so you can answer questions you did not anticipate without shipping new code. Monitoring tells you something broke; observability helps you find out why.
What are the three pillars of observability?: Metrics (aggregated numbers over time, cheap to store, good for alerts), logs (detailed timestamped event records, rich but expensive at volume), and traces (the path of a single request across services, showing where time was spent). They are most useful when correlated through a shared trace ID.
Do I need distributed tracing for a small app?: Usually not. For a single service or monolith, structured logs and a debugger cover most cases. Tracing pays off once a request crosses multiple services, because that is when you cannot tell from logs alone which service caused the slowdown or failure.
What is OpenTelemetry?: OpenTelemetry is a vendor neutral open standard and set of libraries for generating and exporting metrics, logs, and traces. Instrumenting once with OpenTelemetry lets you send the same telemetry to many backends like Prometheus, Jaeger, or Datadog without rewriting your code.
Why is high cardinality a problem in metrics?: Each unique combination of metric labels creates a separate time series. Tagging a metric with an unbounded value like user ID or full URL can produce millions of series, which slows queries and blows up storage costs. Keep metric labels low cardinality and push detailed per request data into logs and traces instead.

Learn Observability hands-on

Open the Observability lesson See pricing

Observability

What is Observability?

What observability actually means

The three pillars under the hood

When to invest and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

Observability

What is Observability?

What observability actually means

The three pillars under the hood

When to invest and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

What is Observability?

What observability actually means

The three pillars under the hood

When to invest and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Observability?

What observability actually means

The three pillars under the hood

When to invest and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also