Data Lake
A centralized repository that stores raw data at any scale in its native format. Unlike a data warehouse, data doesn't need to be structured or cleaned before loading.
What is Data Lake?
A centralized repository that stores raw data at any scale in its native format. Unlike a data warehouse, data doesn't need to be structured or cleaned before loading.
Data Lake is a advanced concept that sits in the Stream & Batch Processing area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Data Lake" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Data Lake in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Data Lake lessonRelated lessons
Lessons that touch on Data Lake as part of a larger topic.
See also
Related glossary terms you might want to look up next.
Data Warehouse
A central repository of structured, cleaned data optimized for analytical queries. Snowflake, BigQuery, and Redshift are purpose-built data warehouses.
ETL
Extract, Transform, Load: a pipeline that extracts data from sources, transforms it into the desired format, and loads it into a destination like a data warehouse.
Object Storage
A storage architecture that manages data as objects (file + metadata + ID) rather than blocks or files. S3 is the gold standard. Infinitely scalable, cheap, and durable.