Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between Apache Flink and Apache Spark?

Spark started as a batch engine and does streaming by splitting data into small micro-batches, which adds latency. Flink is streaming first: it processes each event as it arrives, giving lower latency and true event-at-a-time semantics, and it treats batch as a stream that ends. Spark is often simpler for pure batch reports; Flink wins for low-latency stateful streaming.

What does exactly-once mean in Flink?

It means each input event affects the job's state and results once, even after a crash. Flink achieves this by snapshotting operator state and source positions together in a checkpoint, then on recovery restoring both and replaying only the events after that checkpoint. With supported sinks like Kafka it can extend the guarantee end to end.

What is the difference between event time and processing time?

Processing time is the clock on the machine when an event is handled. Event time is the timestamp inside the event itself, set when it actually happened. Flink prefers event time and uses watermarks to handle late or out-of-order events, so a transaction delayed by the network still falls in the correct time window.

How does Flink store its state?

State is kept local to each operator. By default Flink uses an embedded RocksDB backend that holds state on local disk, which lets a single job manage state into the terabytes. Periodic checkpoints copy that state to durable storage such as S3 or HDFS so it survives failures.

Is Apache Flink hard to operate?

It is heavier than a plain library. You run a cluster of JobManager and TaskManager processes, tune memory and checkpointing, and plan for large state and backpressure. Managed offerings like Amazon Managed Service for Apache Flink or Ververica remove much of that burden if you do not want to run it yourself.

AdvancedStream & Batch Processing

Apache Flink

A distributed stream processing framework that handles both real-time streams and batch data with exactly-once guarantees. Used by Alibaba, Netflix, and Uber at massive scale.

What is Apache Flink?

In short

Apache Flink is an open source distributed engine for stateful processing of data streams. It treats batch data as a bounded stream, processes records one at a time as they arrive with millisecond latency, and keeps large amounts of application state with exactly-once correctness even when machines crash.

What Apache Flink Is

Apache Flink is a framework for running computations over continuous data streams. A stream is just an unbounded sequence of events: clicks, payments, sensor readings, log lines. Flink ingests those events as they arrive and runs your logic on each one, instead of waiting to collect a big file and process it later.

The key word is stateful. Most useful stream jobs need to remember things: a running count per user, the last 5 minutes of orders, a join between two streams. Flink stores that memory, called state, inside the job itself and manages it for you. State can grow into terabytes and Flink keeps it consistent.

Flink also handles batch data, but it does so by modeling a finite dataset as a stream that happens to have an end. This is the opposite of older systems like Spark, which started as batch and bolted streaming on top by chopping data into tiny micro-batches. Flink is streaming first.

How It Works Under The Hood

You write a job as a graph of operators: sources read events, transform operators map and filter and aggregate, and sinks write results out. Flink turns that graph into parallel tasks and spreads them across worker processes called TaskManagers, coordinated by a JobManager. Each operator instance handles a slice of the data, partitioned by a key such as user ID.

State lives next to each operator, usually on local disk through an embedded RocksDB store, so reads and writes stay fast. To survive failures, Flink runs the Chandy-Lamport algorithm: it injects markers called barriers into the stream, and when a barrier flows through every operator, Flink writes a consistent snapshot of all state to durable storage like S3 or HDFS. These snapshots are called checkpoints and run every few seconds.

When a machine dies, Flink rolls every operator back to the last completed checkpoint and replays the events that came after it. Because the source position and the state are restored together, each event affects the result exactly once. This is the exactly-once guarantee, and it is the main reason teams pick Flink over simpler tools.

Flink also understands event time, the timestamp baked into each event, separate from when it was processed. Watermarks tell Flink how far event time has progressed so it can correctly close time windows even when events arrive late or out of order, which is normal in real networks.

When To Use It And The Trade-offs

Reach for Flink when you need low latency on continuous data and the logic depends on accumulated state: real-time fraud scoring, live dashboards, alerting on metrics, joining a stream of orders with a stream of payments, or continuous ETL into a warehouse. Sub-second results and correct math under failure are where it earns its keep.

The cost is operational weight. Flink is a distributed stateful system, so you run a cluster, tune memory and RocksDB, size checkpoint storage, and plan for state that can reach terabytes. Restoring a huge job from a checkpoint takes time, and a badly tuned job can stall under backpressure when a slow sink pushes the slowness upstream.

If your problem is a once-a-day report over files, plain batch with Spark or even SQL is simpler and cheaper. If you just need to move events between systems with light transforms, Kafka Streams or a managed service may be enough. Flink shines when correctness, latency, and large state all matter at the same time.

A Concrete Example

Imagine a payments company that wants to flag a card the moment it is used in two countries within 60 seconds. A Flink job reads the transaction stream from Kafka keyed by card number. For each card it keeps state: the last country and the timestamp of the last swipe.

When a new transaction arrives, the operator compares it against that state. If the country differs and the two events fall inside a 60 second event-time window, it emits an alert to another Kafka topic and the fraud team reacts in real time. The window logic uses watermarks so a transaction delayed by the network still lands in the right window.

If a TaskManager crashes mid-stream, Flink restores every card's state from the last checkpoint and replays the Kafka offsets that were saved in that same checkpoint. No alert is missed and none is duplicated. That combination of per-key state, event-time windows, and exactly-once recovery is exactly what Flink was built for.

Where it is used in production

Alibaba

Runs Flink at massive scale for real-time search ranking and recommendations during Singles Day; its Blink fork was merged back into Flink.

Netflix

Uses Flink for real-time event processing across thousands of streaming jobs feeding personalization and operational metrics.

Uber

Powers surge pricing, fraud detection, and the AthenaX SQL platform on Flink to process trip and rider events in real time.

Apache Kafka

Flink's most common source and sink; jobs read events from Kafka topics and write results back with offsets stored in checkpoints.

Frequently asked questions

What is the difference between Apache Flink and Apache Spark?: Spark started as a batch engine and does streaming by splitting data into small micro-batches, which adds latency. Flink is streaming first: it processes each event as it arrives, giving lower latency and true event-at-a-time semantics, and it treats batch as a stream that ends. Spark is often simpler for pure batch reports; Flink wins for low-latency stateful streaming.
What does exactly-once mean in Flink?: It means each input event affects the job's state and results once, even after a crash. Flink achieves this by snapshotting operator state and source positions together in a checkpoint, then on recovery restoring both and replaying only the events after that checkpoint. With supported sinks like Kafka it can extend the guarantee end to end.
What is the difference between event time and processing time?: Processing time is the clock on the machine when an event is handled. Event time is the timestamp inside the event itself, set when it actually happened. Flink prefers event time and uses watermarks to handle late or out-of-order events, so a transaction delayed by the network still falls in the correct time window.
How does Flink store its state?: State is kept local to each operator. By default Flink uses an embedded RocksDB backend that holds state on local disk, which lets a single job manage state into the terabytes. Periodic checkpoints copy that state to durable storage such as S3 or HDFS so it survives failures.
Is Apache Flink hard to operate?: It is heavier than a plain library. You run a cluster of JobManager and TaskManager processes, tune memory and checkpointing, and plan for large state and backpressure. Managed offerings like Amazon Managed Service for Apache Flink or Ververica remove much of that burden if you do not want to run it yourself.

Learn Apache Flink hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Apache Flink lesson See pricing

Lessons that touch on Apache Flink as part of a larger topic.

Checkpointing
Save system state at known-good points to enable fast recovery after failures
advanced · consistency models

What Apache Flink Is

How It Works Under The Hood

When To Use It And The Trade-offs

A Concrete Example

Where it is used in production

Alibaba

Runs Flink at massive scale for real-time search ranking and recommendations during Singles Day; its Blink fork was merged back into Flink.

Netflix

Uses Flink for real-time event processing across thousands of streaming jobs feeding personalization and operational metrics.

Uber

Powers surge pricing, fraud detection, and the AthenaX SQL platform on Flink to process trip and rider events in real time.

Apache Kafka

Flink's most common source and sink; jobs read events from Kafka topics and write results back with offsets stored in checkpoints.

Frequently asked questions

What is the difference between Apache Flink and Apache Spark?: Spark started as a batch engine and does streaming by splitting data into small micro-batches, which adds latency. Flink is streaming first: it processes each event as it arrives, giving lower latency and true event-at-a-time semantics, and it treats batch as a stream that ends. Spark is often simpler for pure batch reports; Flink wins for low-latency stateful streaming.
What does exactly-once mean in Flink?: It means each input event affects the job's state and results once, even after a crash. Flink achieves this by snapshotting operator state and source positions together in a checkpoint, then on recovery restoring both and replaying only the events after that checkpoint. With supported sinks like Kafka it can extend the guarantee end to end.
What is the difference between event time and processing time?: Processing time is the clock on the machine when an event is handled. Event time is the timestamp inside the event itself, set when it actually happened. Flink prefers event time and uses watermarks to handle late or out-of-order events, so a transaction delayed by the network still falls in the correct time window.
How does Flink store its state?: State is kept local to each operator. By default Flink uses an embedded RocksDB backend that holds state on local disk, which lets a single job manage state into the terabytes. Periodic checkpoints copy that state to durable storage such as S3 or HDFS so it survives failures.
Is Apache Flink hard to operate?: It is heavier than a plain library. You run a cluster of JobManager and TaskManager processes, tune memory and checkpointing, and plan for large state and backpressure. Managed offerings like Amazon Managed Service for Apache Flink or Ververica remove much of that burden if you do not want to run it yourself.

Learn Apache Flink hands-on

Open the Apache Flink lesson See pricing

Apache Flink

What is Apache Flink?

What Apache Flink Is

How It Works Under The Hood

When To Use It And The Trade-offs

A Concrete Example

Where it is used in production

Frequently asked questions

See also

Apache Flink

What is Apache Flink?

What Apache Flink Is

How It Works Under The Hood

When To Use It And The Trade-offs

A Concrete Example

Where it is used in production

Frequently asked questions

See also

What is Apache Flink?

What Apache Flink Is

How It Works Under The Hood

When To Use It And The Trade-offs

A Concrete Example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Apache Flink?

What Apache Flink Is

How It Works Under The Hood

When To Use It And The Trade-offs

A Concrete Example

Where it is used in production

Frequently asked questions

Related lessons

See also