Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between a checkpoint and a savepoint in Flink?

A checkpoint is automatic and managed by the framework for fault recovery; it is taken on a timer and often cleaned up after the next one succeeds. A savepoint is a manual snapshot you trigger to stop a job cleanly, upgrade its code, change parallelism, or migrate it, and it is retained until you delete it.

Does checkpointing guarantee exactly-once processing?

It enables it but does not guarantee it alone. Exactly-once also requires the sources to support replay from a recorded offset and the sinks to be idempotent or transactional. Flink with Kafka sources and a transactional sink can give end-to-end exactly-once; with a plain sink you typically get at-least-once.

How often should I checkpoint?

There is no single answer. Common intervals range from a few seconds to a few minutes. Shorter intervals reduce how much input you must reprocess after a crash but add overhead during normal running. Tune it against your state size, acceptable recovery time, and how much throughput you can spare.

What is incremental checkpointing?

Instead of writing the entire state every time, an incremental checkpoint only writes the parts that changed since the last one. Flink supports this with the RocksDB state backend, which makes checkpoints of very large state, hundreds of gigabytes, practical because each one transfers only the deltas.

Where is checkpoint state stored?

In durable, shared storage so any node can read it during recovery. Common choices are Amazon S3, HDFS, Google Cloud Storage, or Azure Blob Storage. Flink keeps working state locally in memory or RocksDB for speed and copies it to that durable backend when a checkpoint is taken.

AdvancedStream & Batch Processing

Checkpointing

Periodically saving the state of a stream processing job so it can recover from failures without reprocessing everything from the beginning. Flink and Spark use distributed checkpoints.

What is Checkpointing?

In short

Checkpointing is the technique of periodically saving the full internal state of a long-running job, such as a stream processing pipeline, to durable storage so that after a crash the job restarts from the last saved point instead of replaying everything from the beginning. Frameworks like Apache Flink and Spark Structured Streaming take consistent distributed checkpoints that capture both operator state and the exact position in the input stream.

What checkpointing actually is

A streaming job runs for days or weeks and builds up state as it processes data: running counts, windowed aggregates, join buffers, machine learning model weights, deduplication sets. If the machine running that job dies, all of that in-memory state is gone. Without a way to recover it, you would have to reprocess every event from the start of time, which is often impossible or takes hours.

A checkpoint is a consistent snapshot of that state plus a marker of how far the job has read its input. The framework writes it to durable storage such as Amazon S3, HDFS, or a distributed filesystem on a fixed interval, for example every 30 seconds or every minute. When the job restarts, it loads the most recent checkpoint and resumes reading the input from the recorded position.

The key word is consistent. In a distributed job with dozens of parallel operators, you cannot just freeze each one at a random moment, because their states would reflect different points in the stream. A correct checkpoint captures a state that corresponds to one exact cut across all inputs.

How it works under the hood

Apache Flink uses an algorithm based on Chandy-Lamport distributed snapshots. The job manager injects special records called checkpoint barriers into the source streams. A barrier flows downstream with the data. When an operator sees a barrier on all of its input channels, it snapshots its own state and forwards the barrier. Because the barrier marks the boundary, every operator snapshots the state corresponding to the same logical point in the stream, so the combined snapshot is consistent.

Spark Structured Streaming takes a different approach tied to its micro-batch model. It records, for each batch, the input offsets it consumed and the state changes it produced, writing both to a checkpoint directory. On restart it reads the last committed batch and continues. The unit of progress is the batch rather than a barrier flowing through individual records.

Two settings matter the most. The checkpoint interval controls how often state is saved: shorter intervals mean less reprocessing after a crash but more overhead during normal running. The state backend controls where state lives and how it is snapshotted; Flink can keep state in memory or in an embedded RocksDB instance on local disk, then copy it to durable storage at checkpoint time.

When to use it and the trade-offs

Use checkpointing whenever a job holds state that would be expensive or impossible to rebuild, and whenever you need exactly-once or at-least-once delivery guarantees. A fraud detection pipeline counting transactions per card over a sliding hour, or a job maintaining the current inventory per product, cannot afford to lose that state.

The cost is overhead. Writing large state to remote storage takes time and IO, and during the checkpoint the job may briefly slow down. If state is hundreds of gigabytes, a naive full snapshot every minute is too expensive, so Flink supports incremental checkpoints with RocksDB that only write the changed parts.

There is a recovery-versus-throughput tension. Checkpoint too often and you pay constant overhead. Checkpoint too rarely and a crash forces you to reprocess a large window of input, increasing recovery time and the latency of catching back up. Checkpointing is also different from a savepoint: checkpoints are automatic and owned by the framework for fault recovery, while savepoints are manual snapshots you trigger to upgrade code, rescale, or migrate a job.

A concrete real-world example

Imagine a job at a payments company that maintains a running total of spend per user over the last 24 hours to flag anomalies. It has been running for a week and holds state for 50 million users. The Kafka topic it reads from has retained only the last 3 days of events.

If the job crashes at hour 168 with no checkpointing, recovery is broken. Kafka cannot replay the full week because it only keeps 3 days, so the 24-hour totals would be wrong, and even replaying 3 days would take a long time and produce duplicate alerts.

With checkpointing every minute to S3, recovery is simple. The job restarts, loads the snapshot from say 40 seconds before the crash, restores all 50 million user totals, and rewinds its Kafka consumer to the offset stored in that checkpoint. It reprocesses only those last 40 seconds of events and is back to live within seconds, with state intact and no double counting.

Where it is used in production

Apache Flink

Uses checkpoint barriers and the Chandy-Lamport snapshot algorithm to take consistent distributed checkpoints, with RocksDB incremental checkpoints for large state.

Apache Spark

Spark Structured Streaming writes input offsets and state changes per micro-batch to a checkpoint directory on HDFS or S3 for fault recovery.

Apache Kafka Streams

Backs up local state stores to compacted Kafka changelog topics so a restarted or rebalanced instance can restore state without reprocessing all input.

Netflix

Runs large Flink jobs for real-time monitoring and personalization, relying on checkpoints to Amazon S3 so multi-hour stateful pipelines survive node failures.

Frequently asked questions

What is the difference between a checkpoint and a savepoint in Flink?: A checkpoint is automatic and managed by the framework for fault recovery; it is taken on a timer and often cleaned up after the next one succeeds. A savepoint is a manual snapshot you trigger to stop a job cleanly, upgrade its code, change parallelism, or migrate it, and it is retained until you delete it.
Does checkpointing guarantee exactly-once processing?: It enables it but does not guarantee it alone. Exactly-once also requires the sources to support replay from a recorded offset and the sinks to be idempotent or transactional. Flink with Kafka sources and a transactional sink can give end-to-end exactly-once; with a plain sink you typically get at-least-once.
How often should I checkpoint?: There is no single answer. Common intervals range from a few seconds to a few minutes. Shorter intervals reduce how much input you must reprocess after a crash but add overhead during normal running. Tune it against your state size, acceptable recovery time, and how much throughput you can spare.
What is incremental checkpointing?: Instead of writing the entire state every time, an incremental checkpoint only writes the parts that changed since the last one. Flink supports this with the RocksDB state backend, which makes checkpoints of very large state, hundreds of gigabytes, practical because each one transfers only the deltas.
Where is checkpoint state stored?: In durable, shared storage so any node can read it during recovery. Common choices are Amazon S3, HDFS, Google Cloud Storage, or Azure Blob Storage. Flink keeps working state locally in memory or RocksDB for speed and copies it to that durable backend when a checkpoint is taken.

Learn Checkpointing hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Checkpointing lesson See pricing

Lessons that touch on Checkpointing as part of a larger topic.

What checkpointing actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Apache Flink

Uses checkpoint barriers and the Chandy-Lamport snapshot algorithm to take consistent distributed checkpoints, with RocksDB incremental checkpoints for large state.

Apache Spark

Spark Structured Streaming writes input offsets and state changes per micro-batch to a checkpoint directory on HDFS or S3 for fault recovery.

Apache Kafka Streams

Backs up local state stores to compacted Kafka changelog topics so a restarted or rebalanced instance can restore state without reprocessing all input.

Netflix

Runs large Flink jobs for real-time monitoring and personalization, relying on checkpoints to Amazon S3 so multi-hour stateful pipelines survive node failures.

Frequently asked questions

What is the difference between a checkpoint and a savepoint in Flink?: A checkpoint is automatic and managed by the framework for fault recovery; it is taken on a timer and often cleaned up after the next one succeeds. A savepoint is a manual snapshot you trigger to stop a job cleanly, upgrade its code, change parallelism, or migrate it, and it is retained until you delete it.
Does checkpointing guarantee exactly-once processing?: It enables it but does not guarantee it alone. Exactly-once also requires the sources to support replay from a recorded offset and the sinks to be idempotent or transactional. Flink with Kafka sources and a transactional sink can give end-to-end exactly-once; with a plain sink you typically get at-least-once.
How often should I checkpoint?: There is no single answer. Common intervals range from a few seconds to a few minutes. Shorter intervals reduce how much input you must reprocess after a crash but add overhead during normal running. Tune it against your state size, acceptable recovery time, and how much throughput you can spare.
What is incremental checkpointing?: Instead of writing the entire state every time, an incremental checkpoint only writes the parts that changed since the last one. Flink supports this with the RocksDB state backend, which makes checkpoints of very large state, hundreds of gigabytes, practical because each one transfers only the deltas.
Where is checkpoint state stored?: In durable, shared storage so any node can read it during recovery. Common choices are Amazon S3, HDFS, Google Cloud Storage, or Azure Blob Storage. Flink keeps working state locally in memory or RocksDB for speed and copies it to that durable backend when a checkpoint is taken.

Learn Checkpointing hands-on

Open the Checkpointing lesson See pricing

Checkpointing

What is Checkpointing?

What checkpointing actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

See also

Checkpointing

What is Checkpointing?

What checkpointing actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

See also

What is Checkpointing?

What checkpointing actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Checkpointing?

What checkpointing actually is

How it works under the hood

When to use it and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

Related lessons

See also