In the last lesson, you learned about , how long it takes to handle one request. But here's a question latency alone can't answer:
What happens when a million people show up at the same time?
Think about a restaurant again. A restaurant might serve one customer in 10 minutes (that's latency). But can it serve 500 customers in an hour? That depends on something entirely different: how many orders it can process in parallel, how big the kitchen is, and how many chefs are working.
That ability to handle volume? That's throughput.
Throughput matters because real systems don't serve one user at a time. Consider these numbers:
If these systems had great latency (fast for one request) but poor throughput (couldn't handle many requests), they would collapse under load. Throughput is what separates a prototype that works on your laptop from a production system that works for millions of users.
Throughput is the amount of work a system can complete in a given period of time.
It's usually measured as:
Let's make this concrete:
| Restaurant Metric | System Equivalent |
|---|---|
| Customers served per hour | Requests per second |
| Number of tables | Available connections |
| Number of chefs in the kitchen | CPU cores / worker threads |
| Kitchen capacity (stoves, ovens) | Memory and processing power |
| How fast one dish is prepared | (single request time) |
A restaurant with 1 chef might make 10 meals per hour. Add 3 more chefs? Now it can make 40 meals per hour. The time to make one meal (latency) didn't change, but the number of meals completed per hour (throughput) quadrupled.
This is an important distinction:
You can have low latency but low throughput, a single blazing-fast server that can only handle one request at a time. Or you can have higher latency but high throughput, a system that takes a bit longer per request but can handle thousands simultaneously.
The goal is usually to optimize both, but in practice there are trade-offs (more on this later).
One of the best ways to understand throughput is the highway analogy. Let's build this up step by step.
Engineers measure throughput using load testing tools. The basic approach is:
Common metrics to watch:
| Metric | What It Tells You |
|---|---|
| Requests Per Second (RPS) | How many complete requests the system handles each second |
| P50 / P99 | The latency experienced by the median user (P50) vs. The slowest 1% (P99). When throughput is near capacity, P99 spikes dramatically. |
| Error Rate | Percentage of requests that fail. High error rates mean you've exceeded your throughput ceiling. |
| CPU / Memory Usage | How much of your resources are consumed. If CPU hits 100%, throughput can't increase. |
Here's a critical concept many beginners miss:
As throughput approaches maximum capacity, latency spikes.
Think about it: when the highway is empty, you cruise at full speed (low latency). As more cars fill the road, everyone slows down (latency increases). At some point, adding more cars doesn't increase throughput at all, it just makes everything slower.
This is why production systems aim to stay at about 60-70% of maximum throughput. Going above that means any small spike in traffic causes latency to skyrocket.
| Strategy | How It Works | Highway Analogy |
|---|---|---|
| Vertical Scaling | Get a bigger, faster server | Widen each lane so cars can go faster |
| Horizontal Scaling | Add more servers | Build more highways |
| Store frequent results so you skip processing | Create express lanes for regular commuters | |
| Async Processing | Handle non-urgent work in the background | Let delivery trucks use the road at night |
| Database Optimization | Speed up the slowest component | Remove the bottleneck intersection |
| Load Balancing | Distribute traffic across servers | Traffic signs directing cars to emptier routes |
One of the most important lessons in system design is that and throughput are connected, and sometimes improving one hurts the other.
Sometimes you can improve both. is a great example:
But often there are real trade-offs:
Batching increases throughput but increases latency. Imagine a bus vs. A taxi:
Many real systems use batching. When you send a message on WhatsApp, it doesn't immediately sync to every server. Messages are batched together and synced periodically, slightly higher latency, but vastly better throughput.
Processing overhead is another example. Adding encryption to every request makes it more secure, but the CPU time spent encrypting and decrypting increases latency and reduces throughput.
For most applications:
Understanding when to prioritize which metric is one of the key skills that separates junior engineers from senior ones.
3 questions - Score 80% to pass
What is throughput?
A server can handle 100 requests per second with an average latency of 50ms. During a traffic spike, the server receives 200 requests per second. What is most likely to happen?
A ride-sharing company batches nearby ride requests together to assign them more efficiently. What is the trade-off?