Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between a data lake and a data warehouse?

A warehouse stores cleaned, structured data and enforces a schema before loading, optimized for fast SQL analytics. A lake stores raw data of any format and applies a schema only at read time. Warehouses cost more per gigabyte but give predictable performance; lakes are cheaper and more flexible but need governance to stay useful.

What is a data swamp?

A data swamp is a data lake that lost its discipline. Without a metadata catalog, naming conventions, and quality checks, the lake fills with undocumented and duplicated files that nobody can find or trust. The data is technically there but practically unusable, which defeats the purpose.

What is a lakehouse and how is it different from a data lake?

A lakehouse keeps the cheap open object storage of a lake but adds warehouse features like ACID transactions, schema enforcement, and fast SQL through table formats such as Apache Iceberg or Delta Lake. It aims to give you one system instead of copying data between a separate lake and warehouse.

What file formats are used in a data lake?

Raw data can be any format, but analytical data is usually stored in columnar formats like Apache Parquet or ORC because they compress well and let query engines read only the needed columns. Table formats like Iceberg, Delta Lake, and Hudi sit above these files to add transactions and schema evolution.

Why store data on object storage instead of a database?

Object storage like S3 costs around 2 to 3 cents per gigabyte per month and scales to exabytes with no capacity planning. It also separates storage from compute, so you can run large query engines on demand and pay only for compute while it runs. A database couples the two and is far more expensive at petabyte scale.

AdvancedStream & Batch Processing

Data Lake

A centralized repository that stores raw data at any scale in its native format. Unlike a data warehouse, data doesn't need to be structured or cleaned before loading.

What is Data Lake?

In short

A data lake is a centralized storage repository that holds raw data of any type at massive scale in its native format, whether structured tables, JSON logs, images, or video. Unlike a data warehouse, you load data first and decide its structure later when you read it, which is called schema-on-read.

What a data lake actually is

A data lake is one big pool of storage where you dump data exactly as it arrives, with no transformation step required up front. Clickstream JSON, database table exports, Parquet files, PDFs, sensor readings, and raw video can all sit side by side in the same place.

The defining idea is schema-on-write versus schema-on-read. A traditional data warehouse forces you to define columns and clean the data before it can land, which is schema-on-write. A data lake flips that: you write raw bytes immediately and only impose a schema when a query or job reads the file, which is schema-on-read.

In practice the storage layer is almost always cheap object storage. Amazon S3, Azure Data Lake Storage Gen2, and Google Cloud Storage are the common backends. Object storage costs roughly 2 to 3 cents per gigabyte per month, which is why teams keep petabytes around instead of throwing data away.

How it works under the hood

A data lake is not a single product, it is a layered design on top of object storage. The bottom layer is the raw files in S3 or equivalent. On top sits a table format such as Apache Iceberg, Delta Lake, or Apache Hudi that tracks which files belong to which table and adds ACID transactions, schema evolution, and time travel.

A separate metadata catalog records table names, partitions, and file locations so query engines can find data without scanning everything. AWS Glue Data Catalog and Apache Hive Metastore are typical choices. Without a catalog, a lake degrades into a pile of files nobody can query efficiently, often called a data swamp.

Query engines like Trino, Apache Spark, Amazon Athena, and Presto read the files directly. Because compute is separate from storage, you can spin up a hundred Spark workers for a nightly job and shut them down afterward, paying only for the minutes you used. Columnar file formats like Parquet and ORC make these scans fast by reading only the columns a query needs.

When to use it and the trade-offs

Reach for a data lake when you have high volumes of varied or semi-structured data, when you do not yet know every question you will ask, or when machine learning teams need raw features that a cleaned warehouse would have thrown away. It is the natural home for logs, events, and large media.

The cost is governance and discipline. Schema-on-read means bad or inconsistent data flows in freely, and the burden of making sense of it lands on every reader. Lakes need a catalog, naming conventions, partition strategy, and data quality checks, otherwise they rot into swamps that nobody trusts.

Many teams now run a lakehouse, which keeps the cheap open storage of a lake but layers warehouse features like ACID transactions, fine grained access control, and SQL performance through Iceberg or Delta Lake. This blurs the old lake versus warehouse split and is the dominant pattern as of 2026.

Where it is used in production

Netflix

Stores petabytes of viewing and playback events on Amazon S3 and queries them with Trino and Apache Iceberg, a table format Netflix originally created.

Uber

Built Apache Hudi to manage incremental upserts on its data lake, handling trip and pricing data at the scale of hundreds of petabytes.

Databricks

Sells the lakehouse pattern commercially through Delta Lake, adding ACID transactions and SQL performance on top of raw cloud object storage.

Amazon Web Services

Provides the common stack: S3 for storage, Glue Data Catalog for metadata, and Athena for serverless SQL queries directly over lake files.

Frequently asked questions

What is the difference between a data lake and a data warehouse?: A warehouse stores cleaned, structured data and enforces a schema before loading, optimized for fast SQL analytics. A lake stores raw data of any format and applies a schema only at read time. Warehouses cost more per gigabyte but give predictable performance; lakes are cheaper and more flexible but need governance to stay useful.
What is a data swamp?: A data swamp is a data lake that lost its discipline. Without a metadata catalog, naming conventions, and quality checks, the lake fills with undocumented and duplicated files that nobody can find or trust. The data is technically there but practically unusable, which defeats the purpose.
What is a lakehouse and how is it different from a data lake?: A lakehouse keeps the cheap open object storage of a lake but adds warehouse features like ACID transactions, schema enforcement, and fast SQL through table formats such as Apache Iceberg or Delta Lake. It aims to give you one system instead of copying data between a separate lake and warehouse.
What file formats are used in a data lake?: Raw data can be any format, but analytical data is usually stored in columnar formats like Apache Parquet or ORC because they compress well and let query engines read only the needed columns. Table formats like Iceberg, Delta Lake, and Hudi sit above these files to add transactions and schema evolution.
Why store data on object storage instead of a database?: Object storage like S3 costs around 2 to 3 cents per gigabyte per month and scales to exabytes with no capacity planning. It also separates storage from compute, so you can run large query engines on demand and pay only for compute while it runs. A database couples the two and is far more expensive at petabyte scale.

Learn Data Lake hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Data Lake lesson See pricing

Lessons that touch on Data Lake as part of a larger topic.

What a data lake actually is

How it works under the hood

When to use it and the trade-offs

Where it is used in production

Netflix

Stores petabytes of viewing and playback events on Amazon S3 and queries them with Trino and Apache Iceberg, a table format Netflix originally created.

Uber

Built Apache Hudi to manage incremental upserts on its data lake, handling trip and pricing data at the scale of hundreds of petabytes.

Databricks

Sells the lakehouse pattern commercially through Delta Lake, adding ACID transactions and SQL performance on top of raw cloud object storage.

Amazon Web Services

Provides the common stack: S3 for storage, Glue Data Catalog for metadata, and Athena for serverless SQL queries directly over lake files.

Frequently asked questions

What is the difference between a data lake and a data warehouse?: A warehouse stores cleaned, structured data and enforces a schema before loading, optimized for fast SQL analytics. A lake stores raw data of any format and applies a schema only at read time. Warehouses cost more per gigabyte but give predictable performance; lakes are cheaper and more flexible but need governance to stay useful.
What is a data swamp?: A data swamp is a data lake that lost its discipline. Without a metadata catalog, naming conventions, and quality checks, the lake fills with undocumented and duplicated files that nobody can find or trust. The data is technically there but practically unusable, which defeats the purpose.
What is a lakehouse and how is it different from a data lake?: A lakehouse keeps the cheap open object storage of a lake but adds warehouse features like ACID transactions, schema enforcement, and fast SQL through table formats such as Apache Iceberg or Delta Lake. It aims to give you one system instead of copying data between a separate lake and warehouse.
What file formats are used in a data lake?: Raw data can be any format, but analytical data is usually stored in columnar formats like Apache Parquet or ORC because they compress well and let query engines read only the needed columns. Table formats like Iceberg, Delta Lake, and Hudi sit above these files to add transactions and schema evolution.
Why store data on object storage instead of a database?: Object storage like S3 costs around 2 to 3 cents per gigabyte per month and scales to exabytes with no capacity planning. It also separates storage from compute, so you can run large query engines on demand and pay only for compute while it runs. A database couples the two and is far more expensive at petabyte scale.

Learn Data Lake hands-on

Open the Data Lake lesson See pricing

Data Lake

What is Data Lake?

What a data lake actually is

How it works under the hood

When to use it and the trade-offs

Where it is used in production

Frequently asked questions

See also

Data Lake

What is Data Lake?

What a data lake actually is

How it works under the hood

When to use it and the trade-offs

Where it is used in production

Frequently asked questions

See also

What is Data Lake?

What a data lake actually is

How it works under the hood

When to use it and the trade-offs

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Data Lake?

What a data lake actually is

How it works under the hood

When to use it and the trade-offs

Where it is used in production

Frequently asked questions

Related lessons

See also