Disaster Recovery
A plan and set of procedures for restoring systems after a catastrophic failure. Defined by RPO (how much data you can lose) and RTO (how long you can be down).
What is Disaster Recovery?
A plan and set of procedures for restoring systems after a catastrophic failure. Defined by RPO (how much data you can lose) and RTO (how long you can be down).
Disaster Recovery is a advanced concept that sits in the Reliability & Resilience area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Disaster Recovery" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Disaster Recovery in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Disaster Recovery lessonRelated lessons
Lessons that touch on Disaster Recovery as part of a larger topic.
Cross-Region Replication
Replicate data across cloud regions for disaster recovery and global read performance
intermediate · data replication distribution
Failover Routing
Automatically redirect traffic to a standby backend when the primary fails
foundation · load balancing proxies
Global Server Load Balancing (GSLB)
Distribute traffic across geographically dispersed data centers using DNS and health-aware routing
foundation · load balancing proxies
Log Shipping
Transfer complete log files from primary to standby, the oldest and simplest replication method
intermediate · data replication distribution
Active-Passive Configuration
One node serves traffic while the other stands by, the most common HA pattern
intermediate · data replication distribution
See also
Related glossary terms you might want to look up next.
Availability
The percentage of time a system is operational and accessible. Measured in 'nines' — 99.99% availability means about 52 minutes of downtime per year.
Redundancy
Duplicating critical components or functions so that if one fails, a backup takes over. The reason planes have two engines and databases have replicas.
Replication
Keeping copies of the same data on multiple servers. Improves read performance and provides fault tolerance if one server goes down.