Design BookMyShow: System Design Interview Guide
In the 2024 Coldplay India on-sale, roughly 1.3 crore (13 million) people fought for ~174,000 tickets that sold out in about 90 minutes — peaking near 5,000 booking attempts per second against a seat map where each seat can be sold exactly once.
A seat-level ticketing system where the hard part isn't scale, it's correctness under contention: 50,000 people clicking the same seat at the same millisecond, and exactly one of them must win. The design centers on a temporary seat-hold lock (Redis, ~10 min TTL, acquired with an atomic Lua compare-and-set), a virtual waiting room that gates the herd before it ever touches the booking service, and a payment saga that keeps the seat held until money clears — then either confirms the booking or releases the seat. Reads (seat maps, show listings) are massively cached and scale horizontally; writes (the actual booking) are funneled through a narrow, strongly-consistent path.
Asked at: A staple at Indian product companies and FAANG-India loops (Flipkart, Razorpay, Swiggy, Atlassian, Microsoft IDC, Amazon India, PhonePe) because it forces the candidate to reason about strict inventory concurrency, temporary locks with expiry, and payment-booking atomicity rather than just CRUD + caching. It's the canonical "no double-booking under a thundering herd" problem.
Why this question is asked
Most "design X" questions reward breadth — caching, sharding, CDNs. BookMyShow rewards depth on one nasty property: a single seat is a unit of inventory that must be sold exactly once, even when tens of thousands of users select it concurrently and a payment step (which can take 30-120s and fail) sits in the middle of the transaction. It probes whether you understand distributed locks, lock TTLs and their failure modes, optimistic vs pessimistic concurrency, idempotency, saga/compensation, and how to shed load with a waiting room before correctness even comes into play. The read path is easy; the interviewer is watching how you protect the write path.
Requirements
Always clarify these in the first 5 minutes of the interview. Do not start drawing boxes until both lists are agreed.
Functional requirements
- Browse movies/events, cities, cinemas, and showtimes; view a live seat map with per-seat status (available / held / booked) and price tier
- Select one or more seats and place a temporary hold so other users cannot grab them while the current user pays
- Complete payment within the hold window; on success the seats become permanently booked and a ticket/QR is issued
- Automatically release held seats if the user abandons or the hold expires, returning them to the available pool
- Guarantee a seat is never sold to two users (no double-booking) even under extreme concurrency
- Handle on-sale / blockbuster surges (IPL, Coldplay, big releases) without the system collapsing — graceful queueing instead of errors
- Support cancellations/refunds per the event's policy and return inventory if allowed
- Send booking confirmation (email/SMS/in-app) with ticket details
Non-functional requirements
- Strong consistency on the seat-inventory write path — correctness is non-negotiable; a double-booking is a real-money, real-reputation failure
- High availability and graceful degradation on the read path: if booking is overloaded, browsing and seat maps should still work
- Low latency for seat-map reads (target < 100-200ms) so the UI feels live; seat-status reads dominate traffic ~450:1 over writes
- Elastic capacity: handle 50-100x normal load during a hot on-sale (normal a few thousand RPS, peak 100k-300k RPS of mixed traffic)
- Bounded, fair admission during surges via a virtual waiting room rather than first-come server crashes
- Idempotency end-to-end: a retried request or a duplicated payment webhook must never create a second booking or a second charge
- Durability: a confirmed booking + successful charge must survive crashes; partial failures must be reconcilable
Back-of-envelope scale estimates
Show your math. Pulling numbers from thin air signals you have not thought about the load.
Peak booking attempts/sec (single hot on-sale)
~5,000 writes/sec attempted
Grounded in the Coldplay India on-sale: ~1.3 crore users, ~174,000 tickets over a ~90-minute window, reported around 4,800-4,900 booking attempts/sec at peak. This is the number your write path must survive — most attempts will lose the race for a seat.
Read:write ratio
~450:1
Estimated ~3 reads (seat map refresh, availability checks, price lookups) per booking attempt, vs ~32 successful writes/sec. Seat-status reads are the dominant load and are what you cache aggressively; confirmed bookings are a thin trickle by comparison.
Concurrent users in the waiting room
10+ million queued, dispatched ~2,000-2,500/sec
With 1.3 crore users and a fixed inventory, the waiting room holds the herd and admits a steady, controlled trickle to the booking service. Dispatch rate is tuned to what the booking + payment backend can actually absorb, not to demand.
Seat-hold TTL
~5-10 minutes
Long enough to complete a real payment (UPI/card/netbanking can take 30-120s plus user think-time), short enough that abandoned holds don't sterilize inventory during a fast on-sale. Industry write-ups commonly cite a 10-minute hold.
Inventory footprint
Tiny per show, huge in aggregate
One multiplex screen is ~150-300 seats; a single show's seat state is a few KB and fits comfortably in memory/Redis. The challenge is not data volume but the number of concurrent shows (tens of thousands live) and write contention on the hot ones.
Peak bandwidth
hundreds of MB/sec
Coldplay-scale events report ~700+ MB/sec and multiple TB transferred over the event. Most of this is seat-map and asset reads served from CDN/cache, not the booking path.
High-level architecture
The system splits hard into a read plane and a write plane, because they have opposite requirements. The read plane — movie/event catalog, city/cinema/show listings, and the seat map — is overwhelmingly the traffic (roughly 450 reads per write) and tolerates being slightly stale. It's served from a CDN for static assets and a cache (Redis/ElastiCache) for seat maps and show metadata, fronted by stateless API servers behind a load balancer. This plane scales horizontally with no coordination. The write plane is the entire interview. It's narrow on purpose. Before a user can even attempt a booking on a hot show, they pass through a virtual waiting room (BookMyShow has used Queue-it and Cloudflare Waiting Room): the edge issues a signed token granting a queue position, and the room admits users to the live booking service at a controlled rate — a few thousand per second — regardless of how many millions are waiting. This is load-shedding: it converts an uncontrollable thundering herd into a bounded, predictable stream the backend can actually handle. Without it, the 2015 IPL-final collapse (500k+ simultaneous users) repeats. Once admitted, seat selection acquires a temporary hold in Redis. The hold is a key like seat:{showId}:{seatId} set with an atomic compare-and-set (a Lua script: "set this key to my userId only if it doesn't exist") and a TTL of ~10 minutes. Redis is the source of truth for "who currently holds this seat" precisely because that operation is atomic and sub-millisecond. If the SET succeeds, the user owns the seat; if it fails, someone else got it and the UI immediately reflects that. Because it's a single atomic op, two users clicking the same seat in the same millisecond can never both win. Holding a seat is not booking it. Payment sits in the middle, and it's slow and failure-prone, so the booking is modeled as a saga (an orchestrated state machine): create order → reserve inventory (the Redis hold) → initiate payment → on payment success, persist the booking durably to Postgres/Aurora and mark the seat permanently sold; on payment failure, timeout, or abandonment, run the compensating action — release the Redis hold so the seat returns to the pool. The hold's TTL is the safety net: even if a server crashes mid-saga, the lock auto-expires and inventory is never permanently lost. Payment confirmation arrives via webhook, which is deduplicated with an idempotency key so a retried or duplicated webhook can't create a second booking or release something it shouldn't. The confirmed booking is the only thing written to the durable relational store, where a unique constraint on (show_id, seat_id) is the final, absolute backstop against double-booking.
In a real interview, sketch this on the whiteboard before diving into any single box.
Core components
Walk through each service. The interviewer wants to hear what each one owns, not just the names.
Virtual Waiting Room (edge admission control)
Sits in front of the booking service for hot on-sales. Issues signed JWT-style tokens with a queue position and expiry, and admits users to the live system at a fixed, tunable rate (e.g., ~2,000-2,500/sec) no matter how many millions are queued. This is pure load-shedding: it protects every downstream component by converting an unbounded herd into a bounded stream. BookMyShow has used Queue-it and Cloudflare Waiting Room for this.
Catalog & Search service (read plane)
Serves movies, events, cities, cinemas, and showtimes. Read-heavy, cache-friendly, eventually-consistent. Backed by a search index for discovery and a cache for hot listings. Scales horizontally with stateless replicas; has no role in correctness.
Seat-Map service
Returns the live seat layout and status for a given show. Reads the authoritative hold state from Redis and merges it with the booked state from the durable store, then caches the rendered map briefly. This is the highest-QPS component during an on-sale; it must be fast (<100-200ms) and is allowed to be a second or two stale for browsing — the real check happens at hold time.
Seat-Hold / Lock service (Redis)
The heart of concurrency control. Acquires per-seat holds via an atomic Lua compare-and-set with a ~10-minute TTL (key: seat:{showId}:{seatId} -> userId). Uses a quorum/RedLock setup across multiple Redis nodes for hot events to survive a node failure without split-brain. Auto-expiry means abandoned holds self-heal — no orphaned locks freezing inventory.
Booking Orchestrator (saga state machine)
Drives create-order → reserve → pay → confirm/compensate. On success, writes the booking durably and marks seats sold; on any failure or timeout, emits ReleaseInventory to drop the Redis holds and cancel the order. Decouples the slow payment step from the fast lock step and makes partial failures recoverable instead of corrupting inventory.
Payment service + webhook handler
Integrates UPI/cards/netbanking via a PSP (Razorpay/PayU-style). Payment is asynchronous: the user is redirected, and the result arrives via webhook. The handler is strictly idempotent — each webhook is keyed by event/payment id (stored in Redis/DB for a few hours) so duplicate or retried callbacks are no-ops. This prevents double-charges and double-bookings from PSP retries.
Durable booking store (Postgres/Aurora)
The system of record for confirmed bookings, payments, and seat ownership. A UNIQUE constraint on (show_id, seat_id) among active bookings is the absolute, last-line guarantee against double-booking — even if every layer above it has a bug, the database rejects the second insert. Sharded by show/event for the largest catalogs; reads can use replicas.
Notification service
Sends booking confirmations and tickets/QR codes over email/SMS/in-app after a booking is confirmed. Off the critical path — fired from a queue so a slow SMS provider can never delay or block the booking commit.
Message bus (RabbitMQ/Kafka)
Carries saga commands/events (ReserveInventory, InitiatePayment, ReleaseInventory, ConfirmOrder) and notification jobs with at-least-once delivery and retries. At-least-once is the reason every consumer must be idempotent.
Data model
Pick the right store per table. Justify each choice with the access pattern, not by reflex.
showsshow_id (PK)event_id / movie_idcinema_id / venue_idscreen_idstart_timecitystatusOne row per screening/event instance. The unit everything else hangs off. Hot rows during an on-sale are a tiny subset of all live shows; route those to dedicated capacity.
seatsseat_id (PK)screen_idrownumberseat_type / price_tierPhysical seat definitions per screen, largely static. The live booked/held status is NOT primarily stored here — booked status lives in the bookings table (durable) and held status lives in Redis (ephemeral). Keeping volatile status out of this table avoids hammering it with writes.
seat_holds (Redis, not a SQL table)key: seat:{show_id}:{seat_id}value: user_id / session_idTTL: ~10 minutesEphemeral source of truth for 'currently held'. Set/checked atomically via Lua. Auto-expires so abandoned holds free themselves. Never the system of record for a confirmed sale — only for the temporary reservation window.
bookingsbooking_id (PK, UUID)user_idshow_idseat_ids[]status (reserved/confirmed/cancelled)amountversioncreated_atDurable system of record. The hard guarantee: a UNIQUE / exclusion constraint ensuring no two CONFIRMED bookings share the same (show_id, seat_id). A version column supports optimistic concurrency when transitioning reserved -> confirmed.
paymentspayment_id (PK)booking_id (FK)psp_referencestatus (initiated/success/failed)idempotency_keyamountwebhook_event_ids[]Tracks the money. The idempotency_key and recorded webhook event ids let the handler safely ignore duplicate/retried PSP callbacks. Payment status drives the saga's confirm-vs-compensate decision.
usersuser_id (PK)phoneemailcityStandard. Mostly read; not on the contention-critical path. Phone is the primary identity in the Indian market (OTP login).
Deep dives
These are the conversations the interviewer is steering you toward. Practice each one until you can talk through it without notes.
Preventing double-booking: the atomic compare-and-set, not a read-then-write
The naive design — read 'is seat A1 free?', then write 'book A1' — has a race: two requests both read 'free', both write, both succeed. The fix is to make claim a single atomic operation. In Redis: SET seat:{show}:{A1} userId NX PX 600000, or equivalently a Lua script that does GET-then-SET-if-absent in one indivisible step. Exactly one of N concurrent claimants gets the OK; everyone else instantly sees 'taken'. This is why Redis (single-threaded command execution, atomic Lua) is used as the live hold authority instead of a row read in the DB. The durable database then adds a second, independent guarantee: a UNIQUE constraint on (show_id, seat_id) among active bookings. Belt and suspenders — the DB makes a double-booking physically impossible to persist even if the Redis layer were bypassed or buggy. State this two-layer model explicitly; it's what separates a strong answer from a hand-wave.
Pessimistic DB locks (SELECT ... FOR UPDATE) vs. the Redis-hold approach
You could lock at the database: SELECT * FROM seats WHERE seat_id=? FOR UPDATE inside a transaction. It's correct, but it serializes contention on the database connection and holds a DB transaction open for the entire user think-time + payment (tens of seconds). During an on-sale that turns the database into the bottleneck and exhausts the connection pool. The Redis-hold approach moves the contention to an in-memory store built for it, keeps DB transactions short (only the final confirm write), and gives you free auto-expiry via TTL — a FOR UPDATE lock has no natural timeout tied to user behavior. The trade is that Redis is now a critical dependency and you must handle its failure modes (covered next). Optimistic concurrency (a version column checked on the reserved->confirmed transition) is the lightweight DB-side complement, since by confirm time the contention is already resolved by the hold.
Lock TTL: the abandoned-cart problem and the failure modes of expiry
A hold needs a TTL because users abandon and servers crash; without expiry, one rage-quit could sterilize a seat forever. ~10 minutes is the usual choice — enough for a real UPI/card flow plus think-time, short enough to recycle inventory fast in a 90-minute sellout. But TTL introduces its own danger: what if the payment succeeds at minute 10:30, after the hold expired and someone else grabbed the seat? You must not confirm a booking whose hold is gone. Defenses: (1) make the hold window comfortably longer than the PSP timeout; (2) at confirm time, re-validate ownership atomically and treat 'hold lost' as a failure path that refunds the late payment rather than double-booking; (3) the DB UNIQUE constraint catches it regardless — the second confirm insert fails, and that booking is auto-refunded. This 'late payment after lock expiry' edge case is exactly what strong interviewers push on.
The virtual waiting room: solving the herd before correctness even matters
With 1.3 crore users hitting at t=0 and 174k tickets, no amount of clever locking saves you if the herd reaches your servers — connection pools, Redis, and load balancers all melt (the 2015 IPL-final lesson). The waiting room is admission control at the edge: every user gets a signed token with a queue position; the room releases users into the live booking path at a rate the backend can absorb (a few thousand/sec), independent of total demand. Critically it's stateless at the edge and runs before any business logic, so 99% of the herd never touches your database. It also improves fairness (FIFO-ish) and UX (a clear 'you're 40,000th in line, ~6 min' beats a spinner then a 503). The key interview insight: scaling the booking service to absorb the full herd is the wrong goal; shedding the herd to match a fixed, sellable inventory is the right one.
Payment + booking atomicity via saga and compensation
Booking spans a fast local step (acquire hold) and a slow external step (payment) that can fail, time out, or return asynchronously via webhook — you can't wrap that in one ACID transaction. Model it as a saga: ReserveInventory -> InitiatePayment -> (on success) ConfirmBooking, with a compensating ReleaseInventory if payment fails, times out, or the user abandons. The orchestrator owns this state machine and is durable, so a crash mid-flow resumes. Two correctness pillars: the hold's TTL means even a lost orchestrator can't strand inventory, and idempotency means retried steps don't duplicate work. The webhook handler dedupes on payment/event id (cached a few hours) so a PSP that retries its callback three times still produces exactly one confirmed booking and zero extra charges. Mention reconciliation: a periodic job compares PSP records vs. bookings to catch any 'charged but not booked' (refund) or 'booked but not charged' (alert) drift.
Read-path scaling and live seat-map updates
Seat-status reads are ~450x the writes, so the read path must scale independently and cheaply. Serve catalog/static assets from CDN; cache show metadata and rendered seat maps in Redis with a short TTL. The seat map shown for browsing can be a second or two stale — the authoritative check happens atomically at hold time, so a user who clicks an already-taken seat just gets a clean 'taken, pick another'. For the live feel during a hot show, push deltas to clients (WebSocket/SSE) as seats flip held/booked, rather than having millions poll. This keeps the expensive, strongly-consistent path tiny (only actual claims) while the cheap, eventually-consistent path carries the bulk of traffic. Don't try to make the browse view perfectly consistent — that's where naive designs waste their consistency budget.
Trade-offs to discuss
Every senior interviewer expects you to surface at least 3 of these. Pick the decisions, state the alternatives, and justify your choice.
Redis hold as live source of truth + DB UNIQUE constraint as backstop (vs. DB-only locking)
Redis gives sub-millisecond atomic claims and free TTL-based auto-expiry, keeping DB transactions short and the database off the contention hot path. The cost is a hard dependency on Redis and the need to handle its failure modes (quorum/RedLock for hot events). The DB UNIQUE constraint makes a double-booking impossible to persist regardless, so Redis being the fast path doesn't mean it's the only guarantee.
Virtual waiting room (shed the herd) instead of autoscaling the booking service to meet demand
Inventory is fixed and small; demand is unbounded. Scaling compute to absorb 13M concurrent users is wasteful and still risks melting stateful components (Redis, DB, pools). Admission control at the edge matches throughput to what's sellable and protects everything downstream. Trade-off: added edge dependency and a queue UX, plus tuning the dispatch rate.
~10-minute hold TTL (vs. shorter or no expiry)
Long enough for real Indian payment flows (UPI redirect, OTP, netbanking) plus think-time; short enough to recycle seats fast during a sellout. Too short frustrates legitimate payers; too long sterilizes inventory and lets abandoners lock seats. The cost is the 'payment-after-expiry' edge case, which you handle with re-validation + auto-refund.
Saga with compensation (vs. trying for a single distributed ACID transaction across booking + payment)
Payment is external, slow, and asynchronous (webhooks) — it cannot sit inside one ACID transaction. A saga gives recoverability and clear compensation (release the hold) at the price of more moving parts and the need for end-to-end idempotency and a reconciliation job.
Eventually-consistent browse seat map (vs. strongly-consistent everywhere)
Spending the consistency budget only where a seat is actually claimed lets the 450:1 read traffic be cached and cheap. The downside is a brief window where the displayed map is stale; this is acceptable because the atomic hold at claim time is the real arbiter, and a clean 'seat just taken' retry is fine UX.
Idempotency keys everywhere on the write path (extra complexity)
At-least-once message delivery and PSP webhook retries make duplicates inevitable. Dedicating an idempotency key per booking attempt and per payment event prevents double-charges and double-bookings. The cost is extra storage and discipline, but it's non-negotiable when real money is involved.
How BookMyShow actually does it
BookMyShow's current design is, in large part, a reaction to a public failure: during the 2015 IPL final on-sale, roughly 500,000 users hit the system simultaneously and it collapsed. The lessons from that — distributed Redis-based seat locking and a virtual queue for high-demand events — became the template. The 2024 Coldplay India on-sale is the modern stress test that gets quoted in interviews: about 1.3 crore (13 million) users competing for ~174,000 tickets across three days, sold out in roughly 90 minutes, peaking near ~4,800 booking attempts/sec, with the herd gated by a virtual waiting room (Queue-it / Cloudflare Waiting Room). Public engineering write-ups consistently describe the same shape: Redis as the live hold store with an atomic Lua compare-and-set and a ~10-minute TTL, a saga/compensation flow around an asynchronous payment step with idempotent webhook handling, eventually-consistent cached seat maps for the read-heavy browse path, and a durable relational store (Postgres/Aurora) as the final system of record with a uniqueness guarantee. Treat the specific instance counts and per-second cost figures in third-party blogs as informed estimates, not official numbers — but the architectural pattern (waiting room + Redis hold + saga + DB backstop) is well-corroborated and is what interviewers expect you to converge on.
Sources
- Inside BookMyShow's Architecture: How It Prevents Double Booking During Flash Sales
- BookMyShow Seat Selection Architecture: Distributed Locks, Payment Sagas & Zero Double-Booking at Scale
- Handling Coldplay's Ticket Frenzy on BookMyShow: A Back-of-the-Envelope Calculation
- How BookMyShow Leveraged Technology to Manage Coldplay Concert Ticket Sales
- Design BookMyShow - A System Design Interview Question (GeeksforGeeks)
Lessons to study before this interview
If any of these topics are fuzzy, the interviewer will catch it. Each lesson is 15 to 60 minutes with diagrams, code, and a quiz.
Distributed Locks
advanced / distributed systems core
Idempotency
foundation / core fundamentals
Saga Pattern
advanced / distributed systems core
Cache-Aside Pattern
foundation / caching strategies
Rate Limiting for Resilience
advanced / reliability resilience
Design a Payment System
capstone / capstone
Redis Cache
foundation / caching strategies