Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between average latency and tail latency?

Average latency is the mean response time across all requests, which a few slow outliers barely move. Tail latency looks at the slowest requests at high percentiles like p99 and p99.9. A system can have a great average and a terrible tail, and users feel the tail.

What does p99 latency mean exactly?

p99 is the response time that 99 percent of your requests are faster than. Put another way, only the slowest 1 percent of requests take longer than the p99 value. It is the standard way to describe how bad the slow cases get.

Why does tail latency get worse with more backend calls?

If a single user request fans out to many backend calls, it only takes one of them being slow to make the whole request slow. With 100 parallel calls each having a 1 percent chance of hitting the tail, the request is likely to touch a slow call almost every time.

How do hedged requests reduce tail latency?

After a short delay, the client sends a duplicate request to a second replica and uses whichever response comes back first. Since the slow events causing the tail are usually independent per server, both copies are rarely slow at once, so the p99.9 drops sharply for only a small increase in total load.

Should I always optimize for tail latency?

No. For batch or offline work, throughput and average matter more and rare slow requests are invisible. For interactive, user-facing, or heavily fanned-out services, the tail dominates the experience and is worth optimizing until your percentile SLO is met.

AdvancedReliability & Resilience

Tail Latency

The high-percentile response times (p99, p99.9) that affect the slowest requests. A system with 10ms median but 2s p99 latency feels slow for 1 in 100 users.

What is Tail Latency?

In short

Tail latency is the response time of the slowest requests in a system, measured at high percentiles like p99 and p99.9 instead of the average. A service with a 10ms median but a 2 second p99 is fast for most requests yet painfully slow for 1 in every 100, and that slow 1 percent is what users remember and complain about.

What tail latency actually measures

When you sort every request's response time from fastest to slowest, the average sits somewhere in the middle and hides the worst cases. Tail latency looks at the far right end of that sorted list. The p99 is the response time that 99 percent of requests beat, so only the slowest 1 percent are worse. The p99.9 covers the slowest 1 in 1,000, and p99.99 the slowest 1 in 10,000.

These numbers matter because real users do not experience the average. A page that makes 100 backend calls to render will, on average, hit your p99 latency at least once. So if your p99 is 2 seconds, most page loads include a 2 second stall even though your median call is 10ms. The more calls a request fans out to, the more likely it is to touch the tail.

This is why teams that care about user experience set Service Level Objectives on percentiles, not averages. Amazon, Google, and others publish internal targets like p99 under 100ms because they learned that median latency tells you almost nothing about how the product feels.

What causes the tail and how it gets fixed

The tail is made of rare events that pile up. A single request can get unlucky and hit a garbage collection pause, a cold cache miss, a slow disk seek, a noisy neighbor on shared hardware, a lock it has to wait for, a TCP retransmit, or a server that just queued too many requests at once. None of these happen often, but with enough traffic they happen constantly to someone.

Fan-out makes it worse. If one user request triggers 50 parallel backend calls and each backend has a 1 percent chance of a slow response, the odds that at least one of the 50 is slow are about 1 minus 0.99 to the 50th power, roughly 40 percent. So a backend with a great p99 still produces a terrible end-to-end p99 once you fan out.

Common fixes target these tails directly. Hedged requests send a duplicate call to a second replica after a short delay and take whichever returns first, which Google's Jeff Dean described as cutting p99.9 dramatically. Request timeouts with retries cap how bad a single attempt can get. Load shedding rejects excess work before queues build up. Tuning the garbage collector, adding more replicas to reduce queueing, and isolating slow tenants all chip away at the worst percentiles.

When to optimize the tail and the trade-offs

Chasing the tail is expensive, so it is not always the right call. For a batch job that runs overnight, the average throughput matters and a slow request here and there is invisible. For an interactive product, a checkout flow, or any service other services depend on, the tail is the whole game because slow requests stack up across the call graph.

The main trade-off is cost versus consistency. Hedged requests roughly double the load for the requests you hedge, so you trade compute for a tighter p99.9. Adding replicas reduces queueing but costs money and adds coordination. Aggressive timeouts trim the tail but can turn a slow success into a hard failure, so you have to pair them with retries and idempotent operations.

A practical rule: measure first, optimize the percentile your users actually feel, and stop when the SLO is met. Spending weeks to drop p99.99 from 800ms to 600ms is wasted effort if nobody is in that bucket and your p99 is already healthy.

A concrete example

Imagine a product page backed by a microservice that calls a recommendations service. The recommendations service has a median latency of 8ms, a p99 of 40ms, and a p99.9 of 1.5 seconds caused by occasional JVM garbage collection pauses on its servers.

The product page itself looks fast in dashboards because the median is single-digit milliseconds. But the page calls recommendations once per product, and a typical page shows 20 products. The chance that at least one of those 20 calls hits the 1.5 second p99.9 pause is around 2 percent, so roughly 1 in 50 page loads stalls for over a second. Support tickets pile up with vague reports of the site being slow, and nobody can reproduce it because it is random.

The fix is to add a 50ms hedge: if a recommendations call has not returned in 50ms, fire a second call to another replica and use whichever answers first. The duplicate rarely fires, the GC pause on one replica almost never lines up with a pause on the other, and the end-to-end p99.9 drops from 1.5 seconds to under 60ms. Total extra load is only a few percent because hedges only trigger past the 50ms threshold.

Where it is used in production

Google

Coined the modern playbook for tail latency, using hedged requests and tied requests to cut p99.9 across its serving systems.

Amazon

Reported that every 100ms of added latency cost about 1 percent of sales, and built DynamoDB and its services around tight tail latency SLOs.

Netflix

Uses request timeouts, retries, and load shedding via Hystrix and resilience4j to keep one slow dependency from dragging the tail of the whole request.

Cassandra

Offers speculative retry, sending a redundant read to another replica when the first is slow, directly attacking read tail latency.

Frequently asked questions

What is the difference between average latency and tail latency?: Average latency is the mean response time across all requests, which a few slow outliers barely move. Tail latency looks at the slowest requests at high percentiles like p99 and p99.9. A system can have a great average and a terrible tail, and users feel the tail.
What does p99 latency mean exactly?: p99 is the response time that 99 percent of your requests are faster than. Put another way, only the slowest 1 percent of requests take longer than the p99 value. It is the standard way to describe how bad the slow cases get.
Why does tail latency get worse with more backend calls?: If a single user request fans out to many backend calls, it only takes one of them being slow to make the whole request slow. With 100 parallel calls each having a 1 percent chance of hitting the tail, the request is likely to touch a slow call almost every time.
How do hedged requests reduce tail latency?: After a short delay, the client sends a duplicate request to a second replica and uses whichever response comes back first. Since the slow events causing the tail are usually independent per server, both copies are rarely slow at once, so the p99.9 drops sharply for only a small increase in total load.
Should I always optimize for tail latency?: No. For batch or offline work, throughput and average matter more and rare slow requests are invisible. For interactive, user-facing, or heavily fanned-out services, the tail dominates the experience and is worth optimizing until your percentile SLO is met.

Learn Tail Latency hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Tail Latency lesson See pricing

Lessons that touch on Tail Latency as part of a larger topic.

Model Serving and Inference APIs: Turning a Model File Into a Service
How a packaged model becomes a live API that survives thousands of requests per second: online vs batch serving, the request path, REST vs gRPC, dynamic batching, autoscaling, tail latency, and safe rollout
ml-foundation · core

What tail latency actually measures

What causes the tail and how it gets fixed

When to optimize the tail and the trade-offs

A concrete example

Where it is used in production

Google

Coined the modern playbook for tail latency, using hedged requests and tied requests to cut p99.9 across its serving systems.

Amazon

Reported that every 100ms of added latency cost about 1 percent of sales, and built DynamoDB and its services around tight tail latency SLOs.

Netflix

Uses request timeouts, retries, and load shedding via Hystrix and resilience4j to keep one slow dependency from dragging the tail of the whole request.

Cassandra

Offers speculative retry, sending a redundant read to another replica when the first is slow, directly attacking read tail latency.

Frequently asked questions

What is the difference between average latency and tail latency?: Average latency is the mean response time across all requests, which a few slow outliers barely move. Tail latency looks at the slowest requests at high percentiles like p99 and p99.9. A system can have a great average and a terrible tail, and users feel the tail.
What does p99 latency mean exactly?: p99 is the response time that 99 percent of your requests are faster than. Put another way, only the slowest 1 percent of requests take longer than the p99 value. It is the standard way to describe how bad the slow cases get.
Why does tail latency get worse with more backend calls?: If a single user request fans out to many backend calls, it only takes one of them being slow to make the whole request slow. With 100 parallel calls each having a 1 percent chance of hitting the tail, the request is likely to touch a slow call almost every time.
How do hedged requests reduce tail latency?: After a short delay, the client sends a duplicate request to a second replica and uses whichever response comes back first. Since the slow events causing the tail are usually independent per server, both copies are rarely slow at once, so the p99.9 drops sharply for only a small increase in total load.
Should I always optimize for tail latency?: No. For batch or offline work, throughput and average matter more and rare slow requests are invisible. For interactive, user-facing, or heavily fanned-out services, the tail dominates the experience and is worth optimizing until your percentile SLO is met.

Learn Tail Latency hands-on

Open the Tail Latency lesson See pricing

Tail Latency

What is Tail Latency?

What tail latency actually measures

What causes the tail and how it gets fixed

When to optimize the tail and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

Tail Latency

What is Tail Latency?

What tail latency actually measures

What causes the tail and how it gets fixed

When to optimize the tail and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

What is Tail Latency?

What tail latency actually measures

What causes the tail and how it gets fixed

When to optimize the tail and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Tail Latency?

What tail latency actually measures

What causes the tail and how it gets fixed

When to optimize the tail and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also