Is this a video course?

No. This is an interactive, slide-based learning platform. Each lesson has rich text, animated diagrams, live code editors, and quizzes. You learn by reading, interacting, and doing, not by watching videos passively.

How long do I have access?

Forever. Both pricing tiers are one-time payments with lifetime access. This includes all current 766 lessons and any future content we add.

What level of experience do I need?

None. We start from absolute basics like 'What is latency?' and build up to distributed consensus protocols. The Foundation level assumes zero prior knowledge of system design.

How much does the system design course cost?

7.99 US dollars for lifetime access globally, or 499 Indian rupees for lifetime access in India. One-time payment, no subscription, no hidden fees. 11 lessons are free with no signup required.

What technologies are covered?

Everything from DNS and load balancers to Kubernetes, Kafka, distributed databases, consensus protocols, stream processing, security architecture, and observability. We cover principles and real-world implementations used at Netflix, Google, Amazon, Uber, Stripe, and more.

Is this useful for system design interview preparation?

Yes. The lessons are structured around the exact topics asked in system design interviews at FAANG and top-tier companies. Interactive diagrams help you practice whiteboard-style explanations. Covers everything from URL shortener design to distributed payment systems.

How is this different from ByteByteGo or Educative?

766 interactive lessons (4x more than most competitors), 16 different diagram types that build step by step, real production examples from Netflix, Google, Amazon, Uber, and Stripe, and lifetime access for a one-time payment of $7.99 instead of annual subscriptions costing 100 to 200 dollars per year.

What does blameless actually mean in a postmortem?

It means the analysis assumes everyone made reasonable decisions based on the information and tools they had at the time, so it focuses on fixing systems and processes rather than punishing individuals. The goal is honest reporting; if people fear blame, they hide mistakes and the same failures keep recurring.

What is the difference between root cause and contributing factors?

The root cause is the deepest systemic reason the incident was possible, found by asking why repeatedly. Contributing factors are conditions that made it worse or harder to catch, like a missing alert or a stale runbook. Real incidents usually have several contributing factors, not a single tidy cause.

Should you write a postmortem for every incident?

No. Most teams trigger one only for incidents above a set severity, SLO breaches, customer-visible outages, or significant near misses. Writing them for trivial blips wastes time and makes people treat the format as bureaucracy, while skipping them for serious incidents guarantees repeats.

What is the most important part of a postmortem?

The list of action items with named owners and due dates, filed as real tickets and tracked to completion. A well-written narrative with no follow-up changes nothing; the value comes from the fixes that prevent the next occurrence.

Who writes the postmortem?

Usually the incident commander or a senior engineer who was close to the response. They gather logs, metrics, and chat transcripts, draft the timeline and analysis, then bring it to a review meeting where the whole team challenges the conclusions and agrees on the action items.

AdvancedReliability & Resilience

Postmortem

A blameless analysis conducted after an incident to document what happened, why, and how to prevent it from recurring. The most important output is the list of action items.

What is Postmortem?

In short

A postmortem is a written analysis produced after an incident that documents what happened, what the impact was, why it happened, and what concrete changes will stop it from happening again. The most valuable part is the list of tracked action items, and a good postmortem is blameless: it focuses on the systems and processes that failed rather than punishing the people who were on call.

What a postmortem actually is

A postmortem is a document teams write after something goes wrong in production: a site outage, data loss, a payment system that double-charged customers, a deploy that took the API down for 40 minutes. It captures the full story so the organization can learn from it instead of repeating it.

Borrowing the word from medicine, the term means an examination after death, and in engineering it became popular through Google's Site Reliability Engineering practice and Etsy's writing on the topic. The point is not to assign blame but to understand cause. The phrase you will hear constantly is blameless postmortem, which means the document assumes everyone acted with good intentions given the information they had at the time.

A typical postmortem includes a short summary, a timeline of events with timestamps, the customer and business impact, the root cause and contributing factors, what went well and what went poorly during the response, and a list of action items with owners and due dates. That last part matters most. A postmortem with a great narrative but no tracked follow-ups is just a story.

How the process works under the hood

The process starts the moment an incident is declared. During the incident, responders post timestamped updates in a chat channel or incident tool. Those messages become the raw material for the timeline, which is why teams insist on writing things down as they happen rather than reconstructing them from memory later.

After the incident is resolved, an owner is assigned, usually the incident commander or a senior engineer who was close to the response. They gather logs, metrics, deploy records, and chat transcripts, then write the timeline and dig for the actual cause. A useful technique is the Five Whys: keep asking why until you reach a systemic issue rather than stopping at the surface symptom. The disk did not fill up because someone forgot to clean it, it filled up because there was no alert at 80 percent and no automatic log rotation.

The team then holds a review meeting to discuss the draft, challenge the conclusions, and agree on action items. Each action item gets an owner and a deadline and is filed as a real ticket in the issue tracker. Many companies trigger a mandatory postmortem automatically whenever an incident crosses a severity threshold, for example any SEV1 or SEV2.

When to write one and the trade-offs

You do not write a postmortem for every blip. Most teams set a trigger: any incident above a certain severity, any event that breached an SLO, any customer-visible outage, or any near miss that could easily have been catastrophic. Near misses are worth writing up precisely because you got lucky and the next time you might not.

The main trade-off is cost. A thorough postmortem takes hours to write and a meeting to review, so writing one for trivial issues wastes time and trains people to treat the format as bureaucracy. The opposite failure is skipping them for real incidents, which means the same outage recurs in three months because nobody fixed the underlying gap.

The other risk is the blame trap. If a postmortem turns into finding who to fire, engineers stop reporting incidents honestly and start hiding mistakes, which makes systems less safe. This is why blamelessness is a hard rule and not a nicety. The hardest discipline of all is closing action items. It is easy to generate twelve follow-ups in the meeting and let them rot in the backlog, so mature teams track action item completion rate as a metric and review stale items regularly.

A concrete example

Suppose a checkout service goes down for 25 minutes during a Friday evening sale. Customers see errors, and roughly 4,000 orders fail. The on-call engineer pages a teammate, they trace it to a database connection pool that hit its limit of 100 connections, and they recover by restarting the service and raising the pool size.

The postmortem timeline shows the deploy that introduced a slow query at 18:02, the first alert at 18:14, the page at 18:16, and recovery at 18:39. The root cause is not the engineer who shipped the query. It is that the new query was not load tested, there was no alert on connection pool saturation, and the pool limit was set years ago and never revisited.

The action items become: add a connection pool saturation alert at 80 percent, add load testing to the deploy pipeline for queries touching the orders table, and document a runbook for pool exhaustion. Each gets an owner and a due date within two weeks. Three months later, when a similar slow query ships, the alert fires early and the runbook makes recovery a five-minute task instead of a 25-minute outage. That difference is the entire return on writing the document.

Where it is used in production

Google SRE

Popularized the blameless postmortem; their SRE book devotes a full chapter to it and mandates one for incidents above a severity threshold.

Etsy

Their engineering team championed blameless postmortems publicly and built the open-source Morgue tool to standardize writing them.

Amazon Web Services

Publishes detailed public postmortems after major outages, such as the 2017 S3 outage write-up that traced the cause to a mistyped command during routine maintenance.

PagerDuty

Builds incident response tooling that auto-generates postmortem timelines from incident chat and alert data, and publishes an open postmortem playbook.

Frequently asked questions

What does blameless actually mean in a postmortem?: It means the analysis assumes everyone made reasonable decisions based on the information and tools they had at the time, so it focuses on fixing systems and processes rather than punishing individuals. The goal is honest reporting; if people fear blame, they hide mistakes and the same failures keep recurring.
What is the difference between root cause and contributing factors?: The root cause is the deepest systemic reason the incident was possible, found by asking why repeatedly. Contributing factors are conditions that made it worse or harder to catch, like a missing alert or a stale runbook. Real incidents usually have several contributing factors, not a single tidy cause.
Should you write a postmortem for every incident?: No. Most teams trigger one only for incidents above a set severity, SLO breaches, customer-visible outages, or significant near misses. Writing them for trivial blips wastes time and makes people treat the format as bureaucracy, while skipping them for serious incidents guarantees repeats.
What is the most important part of a postmortem?: The list of action items with named owners and due dates, filed as real tickets and tracked to completion. A well-written narrative with no follow-up changes nothing; the value comes from the fixes that prevent the next occurrence.
Who writes the postmortem?: Usually the incident commander or a senior engineer who was close to the response. They gather logs, metrics, and chat transcripts, draft the timeline and analysis, then bring it to a review meeting where the whole team challenges the conclusions and agrees on the action items.

Learn Postmortem hands-on

This page explains the idea. The full lesson lets you step through the ring as servers join and leave, read the implementation, and check yourself with a quiz. It is one of 760+ lessons in the System Design Masterclass, from your first API call to distributed consensus. Eleven Foundation lessons are free, no signup. Lifetime access is ₹499 in India or $7.99 worldwide, one payment, no subscription.

Open the Postmortem lesson See pricing

Lessons that touch on Postmortem as part of a larger topic.

Downtime Tracking
Recording, measuring, and learning from every minute your service is unavailable
intermediate · observability monitoring

What a postmortem actually is

How the process works under the hood

When to write one and the trade-offs

A concrete example

Where it is used in production

Google SRE

Popularized the blameless postmortem; their SRE book devotes a full chapter to it and mandates one for incidents above a severity threshold.

Etsy

Their engineering team championed blameless postmortems publicly and built the open-source Morgue tool to standardize writing them.

Amazon Web Services

Publishes detailed public postmortems after major outages, such as the 2017 S3 outage write-up that traced the cause to a mistyped command during routine maintenance.

PagerDuty

Builds incident response tooling that auto-generates postmortem timelines from incident chat and alert data, and publishes an open postmortem playbook.

Frequently asked questions

What does blameless actually mean in a postmortem?: It means the analysis assumes everyone made reasonable decisions based on the information and tools they had at the time, so it focuses on fixing systems and processes rather than punishing individuals. The goal is honest reporting; if people fear blame, they hide mistakes and the same failures keep recurring.
What is the difference between root cause and contributing factors?: The root cause is the deepest systemic reason the incident was possible, found by asking why repeatedly. Contributing factors are conditions that made it worse or harder to catch, like a missing alert or a stale runbook. Real incidents usually have several contributing factors, not a single tidy cause.
Should you write a postmortem for every incident?: No. Most teams trigger one only for incidents above a set severity, SLO breaches, customer-visible outages, or significant near misses. Writing them for trivial blips wastes time and makes people treat the format as bureaucracy, while skipping them for serious incidents guarantees repeats.
What is the most important part of a postmortem?: The list of action items with named owners and due dates, filed as real tickets and tracked to completion. A well-written narrative with no follow-up changes nothing; the value comes from the fixes that prevent the next occurrence.
Who writes the postmortem?: Usually the incident commander or a senior engineer who was close to the response. They gather logs, metrics, and chat transcripts, draft the timeline and analysis, then bring it to a review meeting where the whole team challenges the conclusions and agrees on the action items.

Learn Postmortem hands-on

Open the Postmortem lesson See pricing

Postmortem

What is Postmortem?

What a postmortem actually is

How the process works under the hood

When to write one and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

Postmortem

What is Postmortem?

What a postmortem actually is

How the process works under the hood

When to write one and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

See also

What is Postmortem?

What a postmortem actually is

How the process works under the hood

When to write one and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also

What is Postmortem?

What a postmortem actually is

How the process works under the hood

When to write one and the trade-offs

A concrete example

Where it is used in production

Frequently asked questions

Related lessons

See also