Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between the map and reduce phases?

Map runs once per input record and emits intermediate key-value pairs, working on each record independently. Reduce runs once per distinct key and receives the full list of values emitted for that key, then aggregates them into the final output. Between them the framework shuffles and sorts so all values for a key reach the same reducer.

Is MapReduce the same as Hadoop?

No. MapReduce is the programming model and execution idea published by Google. Hadoop is an open-source software project that includes an implementation of MapReduce along with the HDFS file system and the YARN resource manager. You can have the MapReduce model without Hadoop, and modern Hadoop clusters often run Spark instead of MapReduce.

Why has Spark mostly replaced MapReduce?

MapReduce writes intermediate results to disk between every map and reduce step, which makes multi-stage and iterative jobs slow. Spark keeps data in memory and models a whole pipeline as one graph, so iterative algorithms and interactive queries run far faster. Spark still relies on the same keyed-parallelism and shuffle concepts under the hood.

What is the shuffle in MapReduce and why is it expensive?

The shuffle is the step that moves map output to reducers, grouping and sorting all values by key so each reducer gets every value for its keys. It is expensive because it transfers large amounts of data across the network and sorts it, which is usually the slowest part of a job. A combiner can pre-aggregate map output to reduce how much data the shuffle moves.

What kinds of problems is MapReduce a bad fit for?

Anything needing low latency or interactivity, like serving a query in milliseconds, and iterative algorithms that loop over the same data many times, such as gradient descent in machine learning. The repeated disk reads and writes between stages make those workloads slow, which is why in-memory engines like Spark are preferred there.

AdvancedStream & Batch Processing

MapReduce

A programming model for processing massive datasets in parallel across a cluster. Map splits data into key-value pairs; Reduce aggregates them. Pioneered by Google, implemented by Hadoop.

What is MapReduce?

In short

MapReduce is a programming model for processing very large datasets in parallel across a cluster of commodity machines. Work is split into two phases: a Map phase that turns input records into intermediate key-value pairs, and a Reduce phase that groups those pairs by key and aggregates each group into a final result. It was introduced by Google in 2004 and popularized by the open-source Hadoop implementation.

What MapReduce actually is

MapReduce is two functions you write plus a framework that runs them at scale. The map function takes one input record and emits zero or more key-value pairs. The reduce function takes one key and the list of all values that were emitted for it, then produces the output. You write only those two functions; the framework handles splitting the data, scheduling tasks, moving data between machines, and recovering from failures.

The classic example is counting words across billions of documents. Map reads a chunk of text and emits the pair word, 1 for every word it sees. The framework groups all those pairs so that every 1 emitted for the word database arrives at the same reducer. Reduce then sums the list of 1s and emits word, total. The same shape, emit keyed pairs then aggregate per key, covers building search indexes, computing log statistics, and joining datasets.

The point of the model is that map tasks are independent of each other and reduce tasks are independent of each other. That independence is what lets the framework run thousands of tasks at once on cheap hardware instead of one expensive machine.

How it works under the hood

The framework first splits the input into pieces, typically 64 MB or 128 MB blocks living in a distributed file system like GFS or HDFS. It launches a map task per split, and it tries to run that task on the same machine that already holds the data so it does not have to ship gigabytes over the network. This is called data locality and it is a major reason the model scales.

Each map task writes its key-value output to local disk, partitioned by key, usually with a hash of the key modulo the number of reducers. Between map and reduce sits the shuffle: reducers pull their partition from every map task and sort the incoming pairs so all values for one key sit together. The shuffle is the expensive part, because it moves data across the whole cluster and sorts it.

A single master node tracks the state of every task. If a worker dies or runs slow, the master simply re-runs that task somewhere else, because map and reduce are deterministic and side-effect-free. To fight slow machines, the framework launches backup copies of the last few tasks, a trick called speculative execution, and keeps whichever copy finishes first. An optional combiner runs a mini-reduce on each map output before the shuffle to cut down network traffic.

When to use it and the trade-offs

MapReduce fits large batch jobs where you scan most of a huge dataset once and do not need an answer in seconds: nightly ETL, log aggregation, training-data preparation, and rebuilding search or recommendation indexes. It shines when the work is embarrassingly parallel and the input is far too big for one machine.

The cost is latency and rigidity. Every job writes intermediate results to disk and reads them back, so even simple jobs take minutes, and a multi-step pipeline pays that disk and shuffle cost at every stage. It is a poor fit for low-latency queries, for iterative algorithms like many machine-learning trainers that loop over the same data hundreds of times, and for anything interactive.

Those weaknesses are why Apache Spark largely replaced raw MapReduce for new work. Spark keeps intermediate data in memory and expresses pipelines as a graph instead of forcing a disk write between every map and reduce, which makes iterative and interactive jobs much faster. The MapReduce model is still worth understanding because its ideas, keyed parallelism, shuffle, and fault tolerance through re-execution, live on inside Spark, Flink, and most batch engines.

A concrete real-world example

Google built MapReduce to regenerate its web search index. Crawlers store raw pages in GFS; a chain of MapReduce jobs parses each page, emits an entry for every word found, and the reduce phase collects all the pages that contain a given word into that word's posting list. Running this across thousands of machines turned an index rebuild from an unmanageable single-machine task into a routine batch job.

On the open-source side, Yahoo ran Hadoop MapReduce on clusters of thousands of nodes to power web search and to win the terabyte sort benchmark, sorting a terabyte of data in well under an hour on commodity servers. Many companies then used the same Hadoop stack for clickstream analysis and reporting before in-memory engines took over the hot paths.

A simple way to picture it: you have a year of server logs spread across 500 files. Map reads each line and emits status_code, 1. Reduce receives all the 1s for code 500 and sums them, so you learn exactly how many errors happened that year without ever loading all 500 files into one machine.

Where it is used in production

Google

Invented MapReduce in 2004 to build and refresh its web search index across thousands of machines on top of GFS.

Apache Hadoop

The open-source implementation of MapReduce plus HDFS that made the model available to everyone outside Google.

Yahoo

Ran Hadoop MapReduce on clusters of thousands of nodes for web search and large-scale log processing, and drove early Hadoop development.

Apache Spark

Succeeded raw MapReduce by keeping intermediate data in memory, but reuses its core ideas of keyed parallelism and shuffle.

Frequently asked questions

What is the difference between the map and reduce phases?: Map runs once per input record and emits intermediate key-value pairs, working on each record independently. Reduce runs once per distinct key and receives the full list of values emitted for that key, then aggregates them into the final output. Between them the framework shuffles and sorts so all values for a key reach the same reducer.
Is MapReduce the same as Hadoop?: No. MapReduce is the programming model and execution idea published by Google. Hadoop is an open-source software project that includes an implementation of MapReduce along with the HDFS file system and the YARN resource manager. You can have the MapReduce model without Hadoop, and modern Hadoop clusters often run Spark instead of MapReduce.
Why has Spark mostly replaced MapReduce?: MapReduce writes intermediate results to disk between every map and reduce step, which makes multi-stage and iterative jobs slow. Spark keeps data in memory and models a whole pipeline as one graph, so iterative algorithms and interactive queries run far faster. Spark still relies on the same keyed-parallelism and shuffle concepts under the hood.
What is the shuffle in MapReduce and why is it expensive?: The shuffle is the step that moves map output to reducers, grouping and sorting all values by key so each reducer gets every value for its keys. It is expensive because it transfers large amounts of data across the network and sorts it, which is usually the slowest part of a job. A combiner can pre-aggregate map output to reduce how much data the shuffle moves.
What kinds of problems is MapReduce a bad fit for?: Anything needing low latency or interactivity, like serving a query in milliseconds, and iterative algorithms that loop over the same data many times, such as gradient descent in machine learning. The repeated disk reads and writes between stages make those workloads slow, which is why in-memory engines like Spark are preferred there.

Learn MapReduce hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the MapReduce lesson See pricing

Lessons that touch on MapReduce as part of a larger topic.

Fan-Out/Fan-In
Split work across many workers, then combine the results into a single output
intermediate · messaging event systems

What MapReduce actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Google

Invented MapReduce in 2004 to build and refresh its web search index across thousands of machines on top of GFS.

Apache Hadoop

The open-source implementation of MapReduce plus HDFS that made the model available to everyone outside Google.

Yahoo

Ran Hadoop MapReduce on clusters of thousands of nodes for web search and large-scale log processing, and drove early Hadoop development.

Apache Spark

Succeeded raw MapReduce by keeping intermediate data in memory, but reuses its core ideas of keyed parallelism and shuffle.

Frequently asked questions

What is the difference between the map and reduce phases?: Map runs once per input record and emits intermediate key-value pairs, working on each record independently. Reduce runs once per distinct key and receives the full list of values emitted for that key, then aggregates them into the final output. Between them the framework shuffles and sorts so all values for a key reach the same reducer.
Is MapReduce the same as Hadoop?: No. MapReduce is the programming model and execution idea published by Google. Hadoop is an open-source software project that includes an implementation of MapReduce along with the HDFS file system and the YARN resource manager. You can have the MapReduce model without Hadoop, and modern Hadoop clusters often run Spark instead of MapReduce.
Why has Spark mostly replaced MapReduce?: MapReduce writes intermediate results to disk between every map and reduce step, which makes multi-stage and iterative jobs slow. Spark keeps data in memory and models a whole pipeline as one graph, so iterative algorithms and interactive queries run far faster. Spark still relies on the same keyed-parallelism and shuffle concepts under the hood.
What is the shuffle in MapReduce and why is it expensive?: The shuffle is the step that moves map output to reducers, grouping and sorting all values by key so each reducer gets every value for its keys. It is expensive because it transfers large amounts of data across the network and sorts it, which is usually the slowest part of a job. A combiner can pre-aggregate map output to reduce how much data the shuffle moves.
What kinds of problems is MapReduce a bad fit for?: Anything needing low latency or interactivity, like serving a query in milliseconds, and iterative algorithms that loop over the same data many times, such as gradient descent in machine learning. The repeated disk reads and writes between stages make those workloads slow, which is why in-memory engines like Spark are preferred there.

Learn MapReduce hands-on

Open the MapReduce lesson See pricing

MapReduce

What is MapReduce?

What MapReduce actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

See also

MapReduce

What is MapReduce?

What MapReduce actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

See also

What is MapReduce?

What MapReduce actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is MapReduce?

What MapReduce actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

Related lessons

See also