Data Replication and Distribution
Every system that matters keeps more than one copy of its data. If your only database lives on a single machine and that machine dies, your business stops. If that machine is in Virginia and your users are in Singapore, every page load pays for a round trip across the planet. Replication and distribution are how you solve both problems at once: keep copies of data on multiple machines, in multiple regions, so that a failure takes out a copy instead of the whole service, and so that reads happen close to the people asking for them.
This is also where most production outages and data-loss incidents are born. The moment you have two copies, you have to answer hard questions. Which copy is the truth? How fast does a write on one copy reach the others? What happens when two regions accept conflicting writes at the same time? This hub walks through the full toolkit, from the replication modes that decide whether you trade durability for speed, through the topologies that decide how copies talk to each other, to the partitioning and geo-distribution techniques that let a single dataset span continents.
What Replication and Distribution Actually Mean
Replication is keeping the same data on more than one node and keeping those copies in agreement. Distribution is spreading data and traffic across many nodes, often across regions, so no single machine holds everything or serves everyone. The two ideas work together. Replication gives you redundancy and read scale; distribution gives you geographic reach and write scale. A real system usually does both at the same time.
The building block is the replication log. When a primary node accepts a write, it records that change and ships it to its replicas. How it records and ships that change is what separates the techniques. Physical replication copies the raw on-disk byte changes, so a replica is an exact block-level twin of the primary. Logical replication ships the change as a row-level event ("user 42 changed email to X"), which lets replicas run different storage engines or even different schema versions. Write-ahead log replication and streaming replication feed the replica directly from the primary's WAL as it is written, which is why they are fast and lossy-resistant. Log shipping batches up log segments and sends them on a schedule, which is simpler but lags more.
Change data capture sits slightly to the side of all this. Instead of replicating a database to another identical database, CDC reads the same replication stream (often the MySQL binlog or the Postgres WAL) and turns it into an event feed that other systems consume. That is how a transactional database keeps a search index, a cache, an analytics warehouse, and a downstream microservice in sync without anyone writing brittle dual-write code.
Synchronous, Asynchronous, and Semi-Synchronous Replication
The single most important decision is when the primary tells the client "your write succeeded." Synchronous replication waits until at least one replica has also stored the write before acknowledging. You lose nothing if the primary dies the next second, because a replica already has the data. The price is latency: every write now pays for a round trip to the replica, and if that replica is slow or unreachable, writes stall. This is the right choice for money movement and anything where losing a committed write is unacceptable.
Asynchronous replication acknowledges the write as soon as the primary has it, then ships to replicas in the background. Writes are fast and a slow replica never blocks the primary. The catch is replication lag: replicas trail the primary by milliseconds to seconds, so a read against a replica can return stale data, and a primary crash can lose the most recent writes that had not yet shipped. Most read-heavy web systems run async and design around the lag rather than paying for synchronous everywhere.
Semi-synchronous replication is the middle ground. The primary waits for a replica to acknowledge receipt of the write (it is in the replica's memory or relay log) but not for the replica to fully apply it. You get most of the durability of synchronous replication with much of the speed of asynchronous, which is why it is a common default for serious production clusters. Understanding replication lag is not optional here, because it determines whether "read your own writes" works for your users and how much data you can lose in a failover.
Topologies and Configurations: How Copies Talk
Once you have more than two nodes, you have to decide how changes flow between them. The simplest is point-to-point, one primary streaming to one replica. Scale that up and you reach for a topology. In a star or hub-and-spoke layout, a central node fans changes out to many leaf nodes, which is easy to reason about but makes the hub a bottleneck and a single point of failure. A ring passes changes node to node around a loop, which spreads load but adds latency with each hop and breaks awkwardly when a node in the ring dies. A mesh or many-to-many topology lets every node talk to every other node, giving the best resilience at the cost of much harder conflict handling.
The topology question is tightly linked to write configuration. Active-passive (active-standby) keeps one node taking writes while others stand ready to take over on failure; it is simple and conflict-free but wastes the standby capacity. Active-active lets multiple nodes accept writes at once, which gives you write scale and lets each region serve its own users with local latency. The hard part is that active-active and multi-master designs invite conflicts when two nodes change the same record concurrently, so you need conflict resolution rules, and bidirectional replication between two masters has to guard against changes echoing back and forth forever.
There is no universally correct topology. A small service is well served by a primary with async read replicas. A global product that must accept writes in every region needs active-active multi-master across a mesh, and must accept the conflict-resolution complexity that comes with it.
Distributing Data Across Nodes and Regions
Replication answers "how do I keep copies in sync." Distribution answers "where does each piece of data live in the first place." Data partitioning (sharding) splits a dataset so each node owns a slice, which is the only way to scale writes and storage past what one machine can hold. The challenge is deciding which node owns which key, and doing it so that adding or removing a node does not reshuffle everything.
Consistent hashing is the standard answer. It maps both keys and nodes onto a ring so that adding a node only moves the keys near it, instead of remapping the entire dataset. This is what lets systems like Dynamo, Cassandra, and large caching tiers grow and shrink without massive rebalancing. It pairs naturally with replication: each key is stored on the next few nodes around the ring, giving you partitioning and redundancy from one mechanism.
Geographic distribution layers region awareness on top. Data locality means placing data near the users and compute that need it, and data locality optimization is the ongoing work of measuring access patterns and moving or caching data to cut cross-region traffic. Geo-replication, cross-region replication, and multi-region deployment keep full or partial copies in several regions so reads are local and the service survives an entire region going dark. These are the techniques behind a single account that loads instantly whether you sign in from Mumbai, Frankfurt, or California.
How Real Companies Use This
Netflix runs Cassandra in an active-active, multi-region setup so that a failed AWS region does not interrupt streaming; consistent hashing spreads data across the ring and replication keeps copies in every region. Their teams routinely practice failing out of a region to prove the distribution actually works under stress.
Uber and large e-commerce platforms lean on change data capture from their transactional databases to keep search, analytics, fraud detection, and caches fresh without dual writes. The same binlog or WAL stream that feeds a replica also feeds a CDC pipeline into Kafka, which fans the changes out to dozens of downstream consumers.
Postgres and MySQL shops typically run a primary with semi-synchronous or asynchronous streaming replicas for read scaling and fast failover, then add cross-region replicas for disaster recovery. Globally distributed databases such as Spanner and CockroachDB push the idea further, using synchronous replication across regions with consensus so they can offer strong consistency worldwide, accepting the higher write latency that comes with coordinating across the planet. The pattern across all of them is the same: pick a replication mode for your durability needs, pick a topology for your write pattern, and pick a distribution strategy for your geography.