Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between load shedding and rate limiting?

Rate limiting enforces a fixed quota per client (for example 100 requests per minute) no matter how busy the server is. Load shedding reacts to the server's real-time health and only rejects requests when the system is actually near overload. They solve different problems and are commonly used together.

What HTTP status code should a shed request return?

Usually 503 Service Unavailable when the server is overloaded, or 429 Too Many Requests when a client-level limit is hit. Both should include a Retry-After header so well-behaved clients back off instead of retrying immediately and making the overload worse.

Why drop requests instead of just queuing them?

Unbounded queuing converts an overload into high latency and eventually memory exhaustion. Queued requests often time out before they are served, so the work was wasted and the client got nothing anyway. Shedding fast returns capacity to requests that can actually complete in time.

How do you decide which requests to shed first?

Assign priority or criticality to traffic. Shed background, retryable, and non-essential requests first (analytics, prefetching) and protect critical, user-facing, and revenue-related requests (payments, logins, health checks). Many systems pass a criticality tag with each request to drive this decision.

Does load shedding work without client retries?

It works better with them, but retries must use exponential backoff and jitter. Naive immediate retries from every shed client create a retry storm that hammers the shedder with the same traffic it just dropped, which can amplify the overload instead of relieving it.

AdvancedReliability & Resilience

Load Shedding

Deliberately dropping low-priority requests during overload to protect the system's ability to serve high-priority traffic. Better to serve some requests than crash serving none.

What is Load Shedding?

In short

Load shedding is the practice of deliberately rejecting or dropping a portion of incoming requests when a system is overloaded, so the requests it does accept get served correctly instead of every request slowing to a crawl or the whole service crashing. The system sheds the least important traffic first to protect its ability to keep serving the most important traffic.

What load shedding actually is

Every server has a finite amount of CPU, memory, connections, and thread or goroutine capacity. When traffic arrives faster than the server can process it, work piles up in queues. If nothing intervenes, latency climbs, memory fills with queued requests, and eventually the server falls over and serves nobody. Load shedding is the decision to say no to some requests on purpose, early and cheaply, so the rest can be served well.

The core idea is simple: it is better to serve 80 percent of requests correctly than to attempt 100 percent and serve 0 percent because the server crashed. A shed request usually gets an immediate HTTP 503 (Service Unavailable) or 429 (Too Many Requests), often with a Retry-After header, instead of being accepted and then timing out 30 seconds later.

Load shedding is different from rate limiting. Rate limiting enforces a fixed quota per client regardless of how busy the server is. Load shedding reacts to the server's actual health right now and only kicks in when the system is genuinely close to its limit. The two are often used together.

How it works under the hood

A shedder needs a signal that the system is overloaded and a policy for what to drop. The signal can be a concurrency limit (number of in-flight requests), queue depth, CPU utilization, request latency creeping past a threshold, or a measured drop in goodput (successfully completed work per second). When the signal crosses a line, the server starts rejecting new requests at the front door, before they consume expensive resources.

What to drop matters as much as when. Good systems assign priority to traffic and shed the least important first. A payment confirmation or a health check from a load balancer should survive while a background analytics call or a non-critical recommendation refresh gets dropped. Netflix's Hystrix and its successor resilience4j, along with Envoy's adaptive concurrency filter, all implement variations of this.

Two common algorithms show up in practice. The first is a static concurrency limit: reject anything beyond N concurrent requests. The second is adaptive, modeled on TCP congestion control, where the server probes for the concurrency level that maximizes goodput and minimizes latency, then sheds beyond it. Netflix's concurrency-limits library and Google's CoDel-based queue management both work this way. The key property is that shedding must be cheap. If rejecting a request costs almost as much as serving it, the shedder makes overload worse.

When to use it and the trade-offs

Use load shedding on any service that faces unpredictable spikes: API gateways, checkout services on sale days, anything fronting a database that can be saturated. It is a last line of defense that keeps a service available during the exact moments it matters most, like a traffic surge, a retry storm, or a downstream dependency slowing down.

The main trade-off is that shedding always means refusing real users. Some legitimate requests get rejected, which is a worse experience than a fast response but far better than a timeout or an outage for everyone. Tuning the threshold is the hard part: set it too low and you reject traffic you could have served; set it too high and the protection triggers too late to help.

Load shedding also interacts badly with naive retries. If clients immediately retry every 503, the shedder gets hammered by the very requests it just dropped, and the retry traffic can be larger than the original load. This is why shedding should be paired with exponential backoff, jitter, and a circuit breaker on the client side. Without those, shedding can amplify an overload instead of relieving it.

A concrete example

Imagine a checkout service that can comfortably handle 2,000 concurrent requests and starts degrading past that. On a flash sale, traffic jumps to 8,000 concurrent requests in a few seconds. Without shedding, all 8,000 requests get accepted, the thread pool exhausts, queue latency climbs to tens of seconds, downstream timeouts fire, and the service effectively serves nobody.

With load shedding, the server admits roughly 2,000 in-flight requests and immediately returns 503 with Retry-After to the rest. Those 2,000 complete in normal time, around 80 to 150 milliseconds. The shed clients back off and retry a moment later, and as the surge passes they get through. The service stays up the entire time and processes far more successful checkouts than it would have if it had tried to accept everything.

AWS describes this pattern in its Builders' Library: shed work at the edge, prioritize by request type, and measure goodput rather than raw request count. The lesson from production is consistent across companies. A service that gracefully refuses excess load survives spikes that would otherwise turn into multi-hour outages.

Where it is used in production

Amazon / AWS

The AWS Builders' Library documents load shedding as a core resilience pattern; services shed low-priority work at the edge and optimize for goodput during overload.

Netflix

Open-sourced the concurrency-limits library that adaptively finds the right in-flight limit and sheds requests beyond it, prioritizing critical traffic over background calls.

Envoy / Istio

Envoy's adaptive concurrency and overload manager filters shed requests automatically when latency or memory crosses configured thresholds in service meshes.

Google

Google's SRE practice uses criticality-based load shedding and CoDel-style queue management so that overloaded backends drop the least important RPCs first.

Frequently asked questions

What is the difference between load shedding and rate limiting?: Rate limiting enforces a fixed quota per client (for example 100 requests per minute) no matter how busy the server is. Load shedding reacts to the server's real-time health and only rejects requests when the system is actually near overload. They solve different problems and are commonly used together.
What HTTP status code should a shed request return?: Usually 503 Service Unavailable when the server is overloaded, or 429 Too Many Requests when a client-level limit is hit. Both should include a Retry-After header so well-behaved clients back off instead of retrying immediately and making the overload worse.
Why drop requests instead of just queuing them?: Unbounded queuing converts an overload into high latency and eventually memory exhaustion. Queued requests often time out before they are served, so the work was wasted and the client got nothing anyway. Shedding fast returns capacity to requests that can actually complete in time.
How do you decide which requests to shed first?: Assign priority or criticality to traffic. Shed background, retryable, and non-essential requests first (analytics, prefetching) and protect critical, user-facing, and revenue-related requests (payments, logins, health checks). Many systems pass a criticality tag with each request to drive this decision.
Does load shedding work without client retries?: It works better with them, but retries must use exponential backoff and jitter. Naive immediate retries from every shed client create a retry storm that hammers the shedder with the same traffic it just dropped, which can amplify the overload instead of relieving it.

Learn Load Shedding hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Load Shedding lesson See pricing

What load shedding actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Amazon / AWS

The AWS Builders' Library documents load shedding as a core resilience pattern; services shed low-priority work at the edge and optimize for goodput during overload.

Netflix

Open-sourced the concurrency-limits library that adaptively finds the right in-flight limit and sheds requests beyond it, prioritizing critical traffic over background calls.

Envoy / Istio

Envoy's adaptive concurrency and overload manager filters shed requests automatically when latency or memory crosses configured thresholds in service meshes.

Google

Google's SRE practice uses criticality-based load shedding and CoDel-style queue management so that overloaded backends drop the least important RPCs first.

Frequently asked questions

What is the difference between load shedding and rate limiting?: Rate limiting enforces a fixed quota per client (for example 100 requests per minute) no matter how busy the server is. Load shedding reacts to the server's real-time health and only rejects requests when the system is actually near overload. They solve different problems and are commonly used together.
What HTTP status code should a shed request return?: Usually 503 Service Unavailable when the server is overloaded, or 429 Too Many Requests when a client-level limit is hit. Both should include a Retry-After header so well-behaved clients back off instead of retrying immediately and making the overload worse.
Why drop requests instead of just queuing them?: Unbounded queuing converts an overload into high latency and eventually memory exhaustion. Queued requests often time out before they are served, so the work was wasted and the client got nothing anyway. Shedding fast returns capacity to requests that can actually complete in time.
How do you decide which requests to shed first?: Assign priority or criticality to traffic. Shed background, retryable, and non-essential requests first (analytics, prefetching) and protect critical, user-facing, and revenue-related requests (payments, logins, health checks). Many systems pass a criticality tag with each request to drive this decision.
Does load shedding work without client retries?: It works better with them, but retries must use exponential backoff and jitter. Naive immediate retries from every shed client create a retry storm that hammers the shedder with the same traffic it just dropped, which can amplify the overload instead of relieving it.

Learn Load Shedding hands-on

Open the Load Shedding lesson See pricing

Load Shedding

What is Load Shedding?

What load shedding actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

Load Shedding

What is Load Shedding?

What load shedding actually is

How it works under the hood

When to use it and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also