Database Types and Storage
Picture the day your single Postgres instance stops being enough. Search queries crawl, your analytics dashboards time out, the recommendation feature your team promised needs vector similarity you do not have, and your storage bill keeps climbing because every byte you have ever written still sits on fast disk. None of those problems is solved by tuning one database harder. They are solved by picking the right kind of storage for each job. That decision, repeated across a system, is what separates an architecture that scales from one that quietly falls over at the worst possible moment.
This category covers the full landscape of how data is stored and retrieved: the physical storage layers (file, block, object, blob), the database families built on top (key-value, document, columnar, column-family, time-series, graph, in-memory, vector), the indexes that make reads fast (inverted, forward, bitmap, geospatial), the data structures storage engines run on (LSM trees, skip lists, bloom filters, merkle trees), the analytics platforms (data warehousing, data lakes), the lifecycle tiers that control cost (hot-warm-cold, tiered, cold, archive, hybrid), and the patterns for combining many stores in one system (polyglot persistence, federation, multi-model, NewSQL, distributed SQL, global tables). The goal is to know what each one is good at, what it is bad at, and which one to reach for under real pressure.
From Raw Storage to Database Engines
Every database sits on top of a storage primitive, and the primitive shapes what the database can do. File storage gives you a hierarchy of named files and directories, which is why shared file systems back content pipelines and home directories. Block storage hands you raw fixed-size blocks with no notion of files at all, which is what database engines actually want when they manage their own pages and write patterns. Object storage drops the hierarchy entirely and stores immutable blobs addressed by a key, with metadata attached, which is why it underpins almost every modern data lake and backup system. Blob storage is the same idea expressed in cloud vendor terms.
The trade-off line runs along latency, mutability, and scale. Block storage is the fastest and most flexible per byte but the hardest to scale and the most expensive. Object storage scales to effectively unlimited capacity at low cost but has higher latency and treats objects as write-once. File storage sits in the middle and is the most familiar to applications but the least elastic. A system design answer that says "store the user uploads in object storage and put the hot operational data on block storage" is already showing it understands the layers underneath the database.
On top of these primitives, the storage engine decides how data is laid out on disk. Row-oriented storage keeps a whole record together, which is ideal when you read and write entire rows in transactional workloads. Columnar storage keeps each column together, which is what makes analytical scans over billions of rows fast because you only read the columns you need and they compress beautifully. Knowing whether a workload is row-shaped or column-shaped is the first fork in almost every storage decision.
The Database Families and What Each Is Built For
There is no single best database, only databases that are good at specific access patterns. Key-value stores give you the fastest possible lookup by a single key and almost nothing else, which is exactly right for sessions, feature flags, and caches. Document stores let each record carry a flexible nested structure, which suits product catalogs and user profiles where the shape varies. Column-family stores spread wide, sparse rows across many machines and are built for very high write throughput at massive scale. Time-series databases optimize for append-heavy, timestamp-ordered data like metrics and sensor readings, where you almost always query recent windows.
Graph databases store relationships as first-class objects, so traversing connections like "friends of friends who liked this" stays fast even many hops deep, where a relational join would explode. In-memory databases keep the working set in RAM to push latency into the microsecond range, trading durability guarantees and cost for raw speed. Vector databases store high-dimensional embeddings and answer similarity-search queries, which is the storage layer that makes semantic search and retrieval-augmented generation possible. Embedding storage, similarity search, and approximate nearest neighbor are the mechanics that make a vector database useful at scale, since exact nearest-neighbor search over millions of vectors is too slow and ANN trades a little accuracy for a large speedup.
For search itself, full-text search and search engines like Elasticsearch and Solr exist because no general-purpose database ranks and matches free text well. They are powered by an inverted index, which maps each term to the documents that contain it, the inverse of a forward index that maps each document to its terms. Bitmap indexes accelerate filtering on low-cardinality columns, and geospatial indexing makes "what is near me" queries fast by partitioning two-dimensional space.
Choosing, Combining, and the Trade-offs That Matter
The honest answer to "which database should I use" is usually "more than one." Polyglot persistence is the deliberate practice of using a different store for each part of the system: a relational database for orders, a key-value store for sessions, a search engine for the catalog, a time-series database for metrics. Database federation puts a query layer in front of several stores so they look like one, and multi-model databases try to support several data models in a single engine to reduce operational sprawl. Each of these is a trade between operational simplicity and using the best tool for each job.
The SQL world has been catching up to the scale that NoSQL once owned. NewSQL databases keep the relational model and ACID transactions while scaling horizontally, and distributed SQL extends that across many nodes and regions. Global tables replicate a single logical table across regions so users everywhere read locally, accepting the consistency trade-offs that come with geo-replication. When you need transactions and scale together, these are the families to study, because the old assumption that you must give up SQL to scale is no longer true.
Underneath all of this sit the data structures that make storage engines work, and understanding them explains why each database behaves the way it does. LSM trees buffer writes in memory and flush sorted runs to disk, which is why write-heavy stores like Cassandra are fast on writes but pay a read and compaction cost. Bloom filters let an engine skip reading a file that definitely does not contain a key, cutting wasted disk reads. Skip lists give in-memory stores ordered data with simple, fast inserts, and merkle trees let replicas detect and repair differences efficiently. These are also the topics that show up most in senior interviews.
How Real Companies Put This Together
Look inside any large system and you find a fleet of storage systems, not one database. Netflix stores user-facing data in Cassandra (a column-family store on LSM trees), keeps its viewing history and metrics in time-series and search systems, and pushes its enormous video catalog into object storage on S3 fronted by a data lake for analytics. The point is not the brand names, it is that each workload landed on the storage type that fits its access pattern.
Cost discipline is the other half of the story, and it is handled by storage lifecycle. Hot-warm-cold architecture and tiered storage move data to cheaper, slower media as it ages, so the metrics you queried today live on fast storage while last year's logs drift down to cold storage and eventually archive storage that costs almost nothing but takes hours to retrieve. Hybrid storage blends on-premise and cloud to balance control and elasticity. Companies that get this right pay for performance only where they need it. Spotify, Uber, and most data-heavy platforms run exactly this kind of tiered, polyglot setup: a graph or document store for the social and catalog data, a search engine over an inverted index for discovery, a vector database for recommendations, a data warehouse and data lake for analytics, and aggressive tiering underneath to keep the bill sane. Learning these as one connected map, rather than as isolated buzzwords, is what lets you design storage that holds up in production.