Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

Is Spark the same as Hadoop?

No. Hadoop is an older ecosystem whose compute engine, MapReduce, wrote intermediate data to disk on every step. Spark is a faster replacement for that compute layer because it keeps data in memory. Spark can run on top of Hadoop's YARN scheduler and read from HDFS, but it does not need Hadoop at all and commonly runs on Kubernetes reading from S3.

Why is Spark faster than MapReduce?

Two reasons. It keeps intermediate results in memory instead of writing them to disk between steps, which matters a lot for iterative jobs that read the same data many times. And its Catalyst optimizer rewrites your query into a more efficient plan, for example pushing filters down so less data is read, before execution.

Should I use RDDs or DataFrames?

Use DataFrames or the Dataset API for almost everything. They go through the Catalyst optimizer and the Tungsten execution engine, so they are usually faster and use less memory than hand written RDD code. Drop to raw RDDs only for low level control that the higher level APIs do not expose, which is rare in modern code.

What is a shuffle and why does it slow Spark down?

A shuffle is when Spark has to move data across the network to regroup it, which happens during joins, group by, and repartition operations. It involves writing data, sending it between machines, and reading it back, so it is far more expensive than work that stays local. Reducing shuffles, for example by broadcasting a small table in a join, is the core of Spark performance tuning.

Can Spark handle real time streaming?

Yes, through Structured Streaming, but it processes data in small micro batches rather than one event at a time, so latency is typically in the seconds. That is fine for most analytics. If you need single digit millisecond per event latency, an event at a time engine like Apache Flink is a better fit.

AdvancedStream & Batch Processing

Apache Spark

A unified analytics engine for large-scale data processing with built-in modules for SQL, streaming, ML, and graph computation. Processes data in-memory for speed.

What is Apache Spark?

In short

Apache Spark is an open source engine for processing large datasets across a cluster of machines. It splits a job into tasks, spreads them over many worker nodes, and keeps intermediate data in memory between steps, which makes it much faster than disk based systems like the original Hadoop MapReduce for most workloads.

What Apache Spark Is

Spark is a distributed compute engine. You write a program that describes a transformation over a dataset, for example read 2 billion log lines, filter the errors, group by hour, and count. Spark figures out how to break that work into pieces and run those pieces in parallel across a cluster, which might be 5 machines or 5000.

It started at UC Berkeley around 2009 and became an Apache project in 2013. The thing that made it take off was speed. The dominant tool at the time, Hadoop MapReduce, wrote intermediate results to disk after every step. Spark keeps that data in RAM where it can, so a multi step job that reads the same data repeatedly does not pay the disk cost every round.

Spark is not a database and not a storage system. It reads from somewhere (S3, HDFS, Kafka, Postgres, a Parquet file) and writes to somewhere. It is the engine in the middle that does the heavy lifting. It ships with four libraries that share the same engine: Spark SQL for tabular queries, Structured Streaming for real time data, MLlib for machine learning, and GraphX for graph computation.

How It Works Under The Hood

A Spark application has one driver and many executors. The driver is the process running your code; it builds the plan and hands out work. Executors are JVM processes on the worker machines that actually run the tasks and hold data in memory. A cluster manager like YARN, Kubernetes, or Spark standalone decides which machines the executors land on.

Spark is lazy. When you call transformations like filter, map, or join, nothing runs yet. Spark just records them as a graph of steps, called the DAG (directed acyclic graph). Only when you call an action, like count, collect, or write, does Spark look at the whole DAG, optimize it, split it into stages, and launch tasks. One task processes one partition of the data, so a dataset split into 200 partitions becomes 200 tasks.

The core abstraction is the partitioned dataset. Older code uses the RDD (resilient distributed dataset). Modern code uses DataFrames and the Dataset API, which go through the Catalyst optimizer. Catalyst rewrites your query into a faster plan, for example pushing a filter down so less data is read, before it ever runs. This is why a DataFrame is usually faster than hand written RDD code.

Fault tolerance comes from lineage. Spark remembers the recipe that produced each partition. If a machine dies mid job, Spark recomputes only the lost partitions from their inputs instead of restarting the whole thing. The expensive step is a shuffle, where data has to be repartitioned across the network, for example during a join or a group by; minimizing shuffles is most of Spark performance tuning.

When To Use It And The Trade Offs

Reach for Spark when the data is too big to fit on one machine and you need to transform or aggregate it. Typical jobs are nightly ETL pipelines, building features for machine learning, joining huge tables, and processing event streams. If your data fits comfortably in memory on a single box, a tool like pandas, DuckDB, or Polars will be simpler and often faster because you skip all the cluster coordination.

The main trade off is overhead and operational weight. Spreading work over a cluster, serializing data, and shuffling across the network all cost time, so for small jobs Spark can be slower than a plain script while being much harder to run. A cluster also costs real money, and a badly tuned job that triggers excess shuffles or spills to disk can be slow and expensive at the same time.

Spark is a batch first engine even for streaming. Structured Streaming processes data in small micro batches, often a few hundred milliseconds to a few seconds apart, so end to end latency is typically seconds. If you need single digit millisecond per event latency, a true event at a time engine like Apache Flink fits better. For most analytics work where seconds are fine, Spark is the safer default because of its huge ecosystem and SQL support.

A Concrete Example

Say you run a streaming service and want a daily report of watch time per country from 500 GB of raw event logs in S3. A single laptop cannot hold 500 GB in memory, and a script reading it serially would take hours.

With Spark you point a DataFrame at the S3 path, which Spark reads as many partitions in parallel. You filter to play events, join against a small users table to get each user's country, group by country, and sum watch seconds. Spark broadcasts the small users table to every executor so the join needs no shuffle, runs the filter and aggregation across, say, 100 executors at once, and writes the result back to S3 as Parquet.

What took hours serially finishes in minutes because the read, filter, and partial aggregation all happen in parallel and the data stays in memory between steps. This exact pattern, big logs in object storage turned into a small aggregated table, is the most common Spark workload in production.

Where it is used in production

Databricks

Founded by Spark's original creators; its entire platform is a managed, optimized Spark runtime sold as a cloud service.

Netflix

Runs large scale Spark ETL and machine learning pipelines on AWS to process viewing and operational data for recommendations and reporting.

Uber

Uses Spark across its data platform for batch ETL and feature generation feeding pricing, ETA, and fraud models.

Amazon EMR

AWS managed cluster service that provisions Spark on demand so teams run jobs without operating their own cluster.

Frequently asked questions

Is Spark the same as Hadoop?: No. Hadoop is an older ecosystem whose compute engine, MapReduce, wrote intermediate data to disk on every step. Spark is a faster replacement for that compute layer because it keeps data in memory. Spark can run on top of Hadoop's YARN scheduler and read from HDFS, but it does not need Hadoop at all and commonly runs on Kubernetes reading from S3.
Why is Spark faster than MapReduce?: Two reasons. It keeps intermediate results in memory instead of writing them to disk between steps, which matters a lot for iterative jobs that read the same data many times. And its Catalyst optimizer rewrites your query into a more efficient plan, for example pushing filters down so less data is read, before execution.
Should I use RDDs or DataFrames?: Use DataFrames or the Dataset API for almost everything. They go through the Catalyst optimizer and the Tungsten execution engine, so they are usually faster and use less memory than hand written RDD code. Drop to raw RDDs only for low level control that the higher level APIs do not expose, which is rare in modern code.
What is a shuffle and why does it slow Spark down?: A shuffle is when Spark has to move data across the network to regroup it, which happens during joins, group by, and repartition operations. It involves writing data, sending it between machines, and reading it back, so it is far more expensive than work that stays local. Reducing shuffles, for example by broadcasting a small table in a join, is the core of Spark performance tuning.
Can Spark handle real time streaming?: Yes, through Structured Streaming, but it processes data in small micro batches rather than one event at a time, so latency is typically in the seconds. That is fine for most analytics. If you need single digit millisecond per event latency, an event at a time engine like Apache Flink is a better fit.

Learn Apache Spark hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Apache Spark lesson See pricing

Lessons that touch on Apache Spark as part of a larger topic.

Data Lakes, Warehouses, and Lakehouses for ML
Data warehouse vs data lake vs lakehouse, explained for ML. Learn schema-on-write vs schema-on-read, open table formats like Delta and Iceberg, time travel, and why ML needs both cheap raw storage and fast structured access.
ml-intermediate · data engineering for ml

What Apache Spark Is

How It Works Under The Hood

When To Use It And The Trade Offs

A Concrete Example

Where it is used in production

Databricks

Founded by Spark's original creators; its entire platform is a managed, optimized Spark runtime sold as a cloud service.

Netflix

Runs large scale Spark ETL and machine learning pipelines on AWS to process viewing and operational data for recommendations and reporting.

Uber

Uses Spark across its data platform for batch ETL and feature generation feeding pricing, ETA, and fraud models.

Amazon EMR

AWS managed cluster service that provisions Spark on demand so teams run jobs without operating their own cluster.

Frequently asked questions

Is Spark the same as Hadoop?: No. Hadoop is an older ecosystem whose compute engine, MapReduce, wrote intermediate data to disk on every step. Spark is a faster replacement for that compute layer because it keeps data in memory. Spark can run on top of Hadoop's YARN scheduler and read from HDFS, but it does not need Hadoop at all and commonly runs on Kubernetes reading from S3.
Why is Spark faster than MapReduce?: Two reasons. It keeps intermediate results in memory instead of writing them to disk between steps, which matters a lot for iterative jobs that read the same data many times. And its Catalyst optimizer rewrites your query into a more efficient plan, for example pushing filters down so less data is read, before execution.
Should I use RDDs or DataFrames?: Use DataFrames or the Dataset API for almost everything. They go through the Catalyst optimizer and the Tungsten execution engine, so they are usually faster and use less memory than hand written RDD code. Drop to raw RDDs only for low level control that the higher level APIs do not expose, which is rare in modern code.
What is a shuffle and why does it slow Spark down?: A shuffle is when Spark has to move data across the network to regroup it, which happens during joins, group by, and repartition operations. It involves writing data, sending it between machines, and reading it back, so it is far more expensive than work that stays local. Reducing shuffles, for example by broadcasting a small table in a join, is the core of Spark performance tuning.
Can Spark handle real time streaming?: Yes, through Structured Streaming, but it processes data in small micro batches rather than one event at a time, so latency is typically in the seconds. That is fine for most analytics. If you need single digit millisecond per event latency, an event at a time engine like Apache Flink is a better fit.

Learn Apache Spark hands-on

Open the Apache Spark lesson See pricing

Apache Spark

What is Apache Spark?

What Apache Spark Is

How It Works Under The Hood

When To Use It And The Trade Offs

A Concrete Example

Where it is used in production

Frequently asked questions

See also

Apache Spark

What is Apache Spark?

What Apache Spark Is

How It Works Under The Hood

When To Use It And The Trade Offs

A Concrete Example

Where it is used in production

Frequently asked questions

See also

What is Apache Spark?

What Apache Spark Is

How It Works Under The Hood

When To Use It And The Trade Offs

A Concrete Example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Apache Spark?

What Apache Spark Is

How It Works Under The Hood

When To Use It And The Trade Offs

A Concrete Example

Where it is used in production

Frequently asked questions

Related lessons

See also