Toil
Manual, repetitive, automatable operational work that scales linearly with service size. SRE teams aim to keep toil below 50% of their time and automate the rest.
What is Toil?
Manual, repetitive, automatable operational work that scales linearly with service size. SRE teams aim to keep toil below 50% of their time and automate the rest.
Toil is a advanced concept that sits in the Reliability & Resilience area of system design. Engineers reach for it whenever they need to reason about real-world trade-offs in that space — not just for textbook correctness, but because real production systems at companies like Netflix, Amazon, and Google make these decisions every day.
If you want to go deeper than this definition — with diagrams, code, and a quiz to lock it in — work through the "Toil" lesson linked below. It walks through the why, the mechanism, the trade-offs, and how the giants actually use it in production.
Learn Toil in depth
Full interactive lesson with diagrams, code examples, real-world references, and a quiz.
Open the Toil lessonRelated lessons
Lessons that touch on Toil as part of a larger topic.
See also
Related glossary terms you might want to look up next.
SRE
Google's discipline for running reliable production systems. Applies software engineering to operations: automation over toil, SLOs over uptime promises, and error budgets for velocity.
CI/CD
Continuous Integration and Continuous Deployment: automating the process of testing and deploying code. Push code, tests run, and it ships to production automatically.
Infrastructure as Code
Managing servers, networks, and cloud resources through declarative configuration files instead of manual setup. Terraform, Pulumi, and CloudFormation are IaC tools.