Fault Tolerance
A system's ability to keep operating correctly even when some of its components fail. Achieved through redundancy, replication, and graceful degradation.
What is Fault Tolerance?
A system's ability to keep operating correctly even when some of its components fail. Achieved through redundancy, replication, and graceful degradation.
Fault Tolerance is a foundational concept that sits in the Core Fundamentals area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Fault Tolerance" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Fault Tolerance in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Fault Tolerance lessonRelated lessons
Lessons that touch on Fault Tolerance as part of a larger topic.
Fault Tolerance
Design systems that continue operating when components fail, redundancy, isolation, and graceful degradation
advanced · reliability resilience
Retry Logic
When network calls fail, and they will, retries turn transient failures into invisible hiccups
intermediate · microservices architecture
Exponential Backoff
Instead of hammering a failing service, wait longer between each retry, giving the system time to recover
intermediate · microservices architecture
Graceful Degradation
When parts of your system fail, give users a reduced experience instead of a broken one
intermediate · microservices architecture
Circuit Breaker Pattern
Prevent cascading failures in distributed systems by failing fast when a downstream service is unhealthy
intermediate · microservices architecture
See also
Related glossary terms you might want to look up next.
Replication
Keeping copies of the same data on multiple servers. Improves read performance and provides fault tolerance if one server goes down.
Circuit Breaker
A pattern that stops calling a failing service after repeated failures, preventing cascade failures. Like an electrical circuit breaker that cuts power to prevent fires.
Chaos Engineering
Deliberately injecting failures into a system to test its resilience. Netflix's Chaos Monkey randomly kills servers to ensure the system survives.