SRE
Google's discipline for running reliable production systems. Applies software engineering to operations: automation over toil, SLOs over uptime promises, and error budgets for velocity.
What is SRE?
Google's discipline for running reliable production systems. Applies software engineering to operations: automation over toil, SLOs over uptime promises, and error budgets for velocity.
SRE is a advanced concept that sits in the Reliability & Resilience area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "SRE" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn SRE in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the SRE lessonRelated lessons
Lessons that touch on SRE as part of a larger topic.
See also
Related glossary terms you might want to look up next.
SLO
Service Level Objective: a target value for an SLI, like '99.9% of requests under 300ms.' The internal engineering goal that drives reliability investment.
Error Budget
The allowed amount of unreliability derived from SLOs. If your SLO is 99.9% uptime, your error budget is 0.1% (about 43 minutes/month). Once exhausted, freeze deployments.
Postmortem
A blameless analysis conducted after an incident to document what happened, why, and how to prevent it from recurring. The most important output is the list of action items.