Checkpointing
Periodically saving the state of a stream processing job so it can recover from failures without reprocessing everything from the beginning. Flink and Spark use distributed checkpoints.
What is Checkpointing?
Periodically saving the state of a stream processing job so it can recover from failures without reprocessing everything from the beginning. Flink and Spark use distributed checkpoints.
Checkpointing is a advanced concept that sits in the Stream & Batch Processing area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Checkpointing" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Checkpointing in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Checkpointing lessonRelated lessons
Lessons that touch on Checkpointing as part of a larger topic.
See also
Related glossary terms you might want to look up next.
Exactly-Once Processing
A processing guarantee where each message is processed exactly one time, even in the face of failures. Achieved through idempotent consumers and transactional producers.
Stream Processing
Processing data continuously as it arrives, rather than in batches. Powers real-time analytics, fraud detection, and live dashboards.
Apache Flink
A distributed stream processing framework that handles both real-time streams and batch data with exactly-once guarantees. Used by Alibaba, Netflix, and Uber at massive scale.