Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What is the difference between RPO and RTO?

RPO is how much data you can lose, measured in time before the disaster. An RPO of 15 minutes means recovery may lose up to 15 minutes of writes. RTO is how long you can be down, measured forward from the disaster. An RTO of 1 hour means service must be restored within an hour. RPO is about your backup or replication frequency; RTO is about how fast your failover process runs.

Is disaster recovery the same as high availability?

No. High availability handles small, expected failures like one server crashing, usually within the same datacenter or region, and keeps running with no human action. Disaster recovery handles large, rare failures like losing an entire region or corrupting your whole database, and often involves a deliberate failover to a separate location. Most serious systems need both.

Is a backup enough for disaster recovery?

A backup is necessary but not sufficient. A backup you have never restored is unproven, and a backup stored in the same region as production can be lost in the same disaster. Real DR requires backups in a separate location, a tested restore procedure, and a known RTO. Teams should run restore drills on a schedule, not just trust that backups exist.

What are the main disaster recovery strategies?

Four tiers, from cheapest and slowest to most expensive and fastest: backup and restore (rebuild from backups, RTO in hours), pilot light (keep a minimal core running, scale up on failover), warm standby (a smaller always-on copy of the full stack), and active-active (full production in multiple regions at once, near-instant failover).

How often should you test a disaster recovery plan?

At least quarterly for critical systems, and after any major architecture change. Testing means actually failing over to the standby and measuring whether you hit your RTO and RPO, not just reading the runbook. Companies like Netflix run regular game days that deliberately take down infrastructure to keep recovery muscle memory sharp.

AdvancedReliability & Resilience

Disaster Recovery

A plan and set of procedures for restoring systems after a catastrophic failure. Defined by RPO (how much data you can lose) and RTO (how long you can be down).

What is Disaster Recovery?

In short

Disaster recovery is the set of plans, tools, and procedures for restoring your systems and data after a catastrophic failure like a region outage, ransomware attack, or accidental mass deletion. It is measured by two numbers: RPO, the maximum amount of data you can afford to lose (often expressed in minutes or hours), and RTO, the maximum time you can be down before recovery completes.

What disaster recovery actually means

Disaster recovery, often shortened to DR, is what you do when an entire system goes down hard and normal failover is not enough. A single crashed server is handled by redundancy and health checks. A disaster is bigger: an AWS region becomes unreachable, a database gets corrupted and replicates the corruption everywhere, a fire takes out a datacenter, or someone runs DROP TABLE on production. DR is the deliberate plan to bring service back from a known-good copy.

Two numbers define every DR plan. RPO, the Recovery Point Objective, is how far back in time you are willing to lose data. An RPO of 15 minutes means your last backup or replica can be at most 15 minutes behind, so a disaster loses at most 15 minutes of writes. RTO, the Recovery Time Objective, is how long the business can tolerate being down. An RTO of 1 hour means service must be fully restored within an hour of the incident.

RPO and RTO are business decisions before they are technical ones. A bank's payment ledger might demand an RPO of zero and an RTO of minutes. An internal analytics dashboard might happily accept an RPO of 24 hours and an RTO of a day. The cost of the DR setup scales steeply with how close to zero you push both numbers.

How it works under the hood

The foundation of DR is a copy of your data that lives somewhere the disaster cannot reach, usually a different geographic region or cloud account. How fresh that copy is determines your RPO. Daily snapshots give an RPO of up to 24 hours. Continuous replication of the database transaction log gives an RPO of seconds. Tools like PostgreSQL streaming replication, AWS RDS cross-region read replicas, and database log shipping all exist to shrink RPO.

RTO is determined by how much you have pre-built. The industry groups DR strategies into four tiers. Backup and restore is the cheapest: you keep backups in another region and rebuild everything during the disaster, giving an RTO of hours to days. Pilot light keeps a minimal core running, like a replicated database with the application servers switched off, and you scale up on failover. Warm standby runs a smaller but always-on copy of the full stack that you scale to full size. Active-active runs full production in two or more regions at once, so failover is close to instant.

Failover itself is usually a DNS or load balancer change that redirects traffic to the healthy region, combined with promoting the standby database to primary. The hard parts are detecting the disaster correctly, avoiding split-brain where two regions both think they are primary, and making sure the failover path has actually been tested.

When to use which strategy and the trade-offs

Match the strategy to the RPO and RTO the business will pay for. Backup and restore is right for non-critical systems where a few hours of downtime is acceptable and you want to minimize standing cost. Pilot light and warm standby suit important systems that need recovery in minutes but cannot justify running two full production stacks. Active-active is for systems where downtime directly costs revenue or trust, like payments, ad serving, or healthcare records.

The central trade-off is cost versus recovery speed. Active-active can roughly double your infrastructure bill and is genuinely hard to build because the application must tolerate writes in multiple regions and resolve conflicts. Backup and restore is cheap but leaves you exposed to long outages and the risk that a backup you never tested fails to restore.

The most common failure is a DR plan that exists only on paper. A backup that has never been restored is not a backup, it is a hope. Mature teams run regular DR drills, sometimes called game days, where they deliberately fail over to the standby region and measure whether they actually hit their RTO and RPO. Netflix popularized this discipline with its Chaos Monkey and later region-evacuation exercises.

A concrete real-world example

Imagine a payments company running its primary database in AWS us-east-1 with an RPO of 5 minutes and an RTO of 15 minutes. They keep a cross-region read replica in us-west-2 that lags the primary by 1 to 2 seconds, well inside their 5 minute RPO. Application servers run in both regions, but us-west-2 normally serves zero traffic to save cost, a warm standby setup.

One morning us-east-1 has a network partition and the primary becomes unreachable. Automated monitoring confirms the region is down, not just one host. The on-call engineer triggers the runbook: promote the us-west-2 replica to primary, scale up the us-west-2 application fleet, and update Route 53 to point the domain at us-west-2. Traffic shifts and the company is back online in 11 minutes, having lost about 2 seconds of in-flight transactions.

When us-east-1 recovers, they do not simply switch back. They rebuild us-east-1 as the new standby, verify it is replicating from the now-primary us-west-2, and only then plan a controlled failback during a quiet window. Rushing the failback is how teams turn one disaster into two.

Where it is used in production

Amazon Web Services

Provides the building blocks for most DR plans: cross-region S3 replication, RDS cross-region replicas, EBS snapshots, and Route 53 health-check failover.

Netflix

Runs active-active across multiple AWS regions and regularly evacuates an entire region in production drills to prove it can fail over without customer impact.

PostgreSQL

Streaming replication and write-ahead-log shipping let teams keep a hot standby seconds behind the primary, pushing RPO close to zero.

Cloudflare

Its global anycast network and load balancing route traffic away from failed datacenters automatically, acting as the traffic-steering layer of many DR designs.

Frequently asked questions

What is the difference between RPO and RTO?: RPO is how much data you can lose, measured in time before the disaster. An RPO of 15 minutes means recovery may lose up to 15 minutes of writes. RTO is how long you can be down, measured forward from the disaster. An RTO of 1 hour means service must be restored within an hour. RPO is about your backup or replication frequency; RTO is about how fast your failover process runs.
Is disaster recovery the same as high availability?: No. High availability handles small, expected failures like one server crashing, usually within the same datacenter or region, and keeps running with no human action. Disaster recovery handles large, rare failures like losing an entire region or corrupting your whole database, and often involves a deliberate failover to a separate location. Most serious systems need both.
Is a backup enough for disaster recovery?: A backup is necessary but not sufficient. A backup you have never restored is unproven, and a backup stored in the same region as production can be lost in the same disaster. Real DR requires backups in a separate location, a tested restore procedure, and a known RTO. Teams should run restore drills on a schedule, not just trust that backups exist.
What are the main disaster recovery strategies?: Four tiers, from cheapest and slowest to most expensive and fastest: backup and restore (rebuild from backups, RTO in hours), pilot light (keep a minimal core running, scale up on failover), warm standby (a smaller always-on copy of the full stack), and active-active (full production in multiple regions at once, near-instant failover).
How often should you test a disaster recovery plan?: At least quarterly for critical systems, and after any major architecture change. Testing means actually failing over to the standby and measuring whether you hit your RTO and RPO, not just reading the runbook. Companies like Netflix run regular game days that deliberately take down infrastructure to keep recovery muscle memory sharp.

Learn Disaster Recovery hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Disaster Recovery lesson See pricing

Lessons that touch on Disaster Recovery as part of a larger topic.

What disaster recovery actually means

How it works under the hood

When to use which strategy and the trade-offs

A concrete real-world example

Where it is used in production

Amazon Web Services

Provides the building blocks for most DR plans: cross-region S3 replication, RDS cross-region replicas, EBS snapshots, and Route 53 health-check failover.

Netflix

Runs active-active across multiple AWS regions and regularly evacuates an entire region in production drills to prove it can fail over without customer impact.

PostgreSQL

Streaming replication and write-ahead-log shipping let teams keep a hot standby seconds behind the primary, pushing RPO close to zero.

Cloudflare

Its global anycast network and load balancing route traffic away from failed datacenters automatically, acting as the traffic-steering layer of many DR designs.

Frequently asked questions

What is the difference between RPO and RTO?: RPO is how much data you can lose, measured in time before the disaster. An RPO of 15 minutes means recovery may lose up to 15 minutes of writes. RTO is how long you can be down, measured forward from the disaster. An RTO of 1 hour means service must be restored within an hour. RPO is about your backup or replication frequency; RTO is about how fast your failover process runs.
Is disaster recovery the same as high availability?: No. High availability handles small, expected failures like one server crashing, usually within the same datacenter or region, and keeps running with no human action. Disaster recovery handles large, rare failures like losing an entire region or corrupting your whole database, and often involves a deliberate failover to a separate location. Most serious systems need both.
Is a backup enough for disaster recovery?: A backup is necessary but not sufficient. A backup you have never restored is unproven, and a backup stored in the same region as production can be lost in the same disaster. Real DR requires backups in a separate location, a tested restore procedure, and a known RTO. Teams should run restore drills on a schedule, not just trust that backups exist.
What are the main disaster recovery strategies?: Four tiers, from cheapest and slowest to most expensive and fastest: backup and restore (rebuild from backups, RTO in hours), pilot light (keep a minimal core running, scale up on failover), warm standby (a smaller always-on copy of the full stack), and active-active (full production in multiple regions at once, near-instant failover).
How often should you test a disaster recovery plan?: At least quarterly for critical systems, and after any major architecture change. Testing means actually failing over to the standby and measuring whether you hit your RTO and RPO, not just reading the runbook. Companies like Netflix run regular game days that deliberately take down infrastructure to keep recovery muscle memory sharp.

Learn Disaster Recovery hands-on

Open the Disaster Recovery lesson See pricing

Disaster Recovery

What is Disaster Recovery?

What disaster recovery actually means

How it works under the hood

When to use which strategy and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

See also

Disaster Recovery

What is Disaster Recovery?

What disaster recovery actually means

How it works under the hood

When to use which strategy and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

See also

What is Disaster Recovery?

What disaster recovery actually means

How it works under the hood

When to use which strategy and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Disaster Recovery?

What disaster recovery actually means

How it works under the hood

When to use which strategy and the trade-offs

A concrete real-world example

Where it is used in production

Frequently asked questions

Related lessons

See also