Design YouTube: System Design Interview Guide
YouTube serves 1 billion hours of video per day across 2.5 billion users, with 500 hours of new content uploaded every minute.
Designing YouTube combines a giant video upload pipeline, multi-resolution encoding, CDN delivery, a massive recommendation system, and a comments and engagement layer. The hardest piece is the upload-to-playback pipeline: how a 4K video uploaded in Mumbai is playable in São Paulo within minutes.
Asked at: Commonly asked at Google (YouTube), Meta, Amazon Prime Video, Netflix, Disney+, and TikTok. Often paired with Design Netflix to compare user-generated vs licensed content.
Why this question is asked
Design YouTube tests whether you understand asynchronous pipelines (upload, encode, package), CDN scaling for billions of streams, recommendation systems trained on watch behavior, and a comments system that has to handle moderation. The volume (500 hours uploaded per minute) forces you to make every step batchable or async.
Requirements
Always clarify these in the first 5 minutes of the interview. Do not start drawing boxes until both lists are agreed.
Functional requirements
- Users upload videos of any length and resolution
- Videos are transcoded to multiple resolutions for adaptive bitrate streaming
- Users browse, search, and watch videos
- Personalized recommendations on home and watch-next
- Users like, dislike, comment, and subscribe
- Live streaming as a separate but related feature
- Monetization (ads inserted dynamically)
Non-functional requirements
- Upload-to-playback under 10 minutes for HD, under 30 minutes for 4K
- Playback start latency under 2 seconds
- 99.99% availability for playback
- Global delivery with sub-second buffer fill from the nearest edge
- Cost-efficient storage tiering (cold videos move to cheaper tiers)
- DMCA takedown within hours of report
Back-of-envelope scale estimates
Show your math. Pulling numbers from thin air signals you have not thought about the load.
Total users
2.5B
Public Google reporting. Assume 1.5 average logged-in profile and many anonymous viewers.
Hours watched per day
1B
Public reporting. Average session ~40 minutes per active user.
Hours uploaded per minute
500
Public reporting. That is 720,000 hours of new content per day, every day.
Concurrent streams at peak
100M
Average concurrency works out to 40M, with peak factor of 2.5x at global evening windows.
Storage growth per year
10 to 30 EB
500 hours per minute times 1 GB per hour HD baseline times 5x for multi-resolution encoding times 365 days.
High-level architecture
Upload flow: Client uploads to an Upload Service via resumable HTTP (tus or YouTube's own protocol). The Upload Service writes the raw file to a regional object store and emits an UploadComplete event. An Encoding Pipeline picks up the event and runs a fleet of jobs that transcode the master into multiple resolutions and bitrates (240p to 4K), package into HLS and DASH manifests, and write each variant to the object store. Once at least the 360p variant is ready, the video is marked playable and indexed by Search. The CDN (Google's own edge network) pre-warms popular videos to edge POPs. Playback: Client requests a manifest from the Playback API, which returns CDN URLs and adapts. Recommendations: a separate offline pipeline (TensorFlow on TPUs) trains and serves personalized rankings, fronted by a low-latency online layer.
In a real interview, sketch this on the whiteboard before diving into any single box.
Core components
Walk through each service. The interviewer wants to hear what each one owns, not just the names.
Upload Service
Resumable upload endpoint. Handles flaky mobile connections with byte-range resume. Writes raw master to regional Google Cloud Storage. Emits an UploadComplete event on Pub/Sub.
Encoding Pipeline
Fleet of jobs that consume UploadComplete and produce 10 to 30 encoded variants per video. Uses MapReduce-style chunked encoding: split the master into 30-second segments, encode in parallel, concatenate. Modern variants use AV1 for bandwidth savings.
Video Metadata Service
Stores video metadata (title, description, channel, tags, upload time). Backed by Spanner or a sharded SQL system. Read-heavy; aggressively cached.
Playback API
Issues signed manifests on play. Includes DRM tokens for monetized content. Selects the nearest CDN POP based on client IP.
Search Service
Indexes video metadata and transcripts (auto-generated by speech recognition). Backed by a custom search system. Returns ranked results based on relevance, freshness, and engagement.
Recommendation Service
Two-stage system. Candidate generation: a neural network produces ~hundreds of candidate videos per user. Ranking: a second model scores each candidate based on watch-time predictions, freshness, and diversity. Online layer adapts the rankings to real-time signals (just-watched).
Comments and Engagement Service
Stores comments, likes, dislikes, subscriptions. Comments are eventually consistent; counters are maintained by a stream processor. Toxicity moderation runs on ingest.
Data model
Pick the right store per table. Justify each choice with the access pattern, not by reflex.
videosvideo_id (PK)channel_idtitledescriptionduration_secondsuploaded_atstatus (uploading, encoding, ready, removed)Sharded by video_id hash. Status drives the playback gate: only ready videos are surfaced.
video_variantsvariant_id (PK)video_id (FK)codec (h264, hevc, av1)resolutionbitrate_kbpsmanifest_urlbyte_sizeOne row per encoded variant. The Playback API picks the right rows based on client capabilities.
watch_eventsuser_id (PK partition)video_idwatched_atwatch_secondscompletedAppend-only event log. Backed by Bigtable. Used as input to recommendation training.
commentscomment_id (PK)video_id (clustering)user_idtextparent_comment_idcreated_atSharded by video_id so that comment threads for a video are co-located.
Deep dives
These are the conversations the interviewer is steering you toward. Practice each one until you can talk through it without notes.
Upload pipeline and resumable transfers
Mobile uploads on flaky networks fail mid-transfer. Resumable uploads use byte-range PUT requests: the client uploads chunks, the server acknowledges each one, and on disconnect the client resumes from the last acked byte. Once the full file is uploaded, an UploadComplete event triggers encoding. The raw master is stored in Google Cloud Storage with regional redundancy. After encoding completes and the video is indexed, the master is moved to cold storage (Nearline or Coldline) because it is rarely needed again.
Distributed encoding with chunked parallelism
A 2-hour video at 4K is ~30 GB. Encoding it serially takes hours. The fix is chunked parallel encoding: split the master into 30-second segments at I-frame boundaries (so each chunk is decodable independently). Distribute the chunks across an encoding fleet (CPU and GPU workers). Each worker encodes its chunk into the target codec and bitrate. Concatenate the encoded chunks. Per-title encoding optimization (popularized by Netflix but also used at YouTube) further tunes the bitrate ladder based on content complexity: a static talking-head video can use lower bitrates than an action scene.
Two-stage recommendation: candidate generation and ranking
Naively scoring every video against every user is impossible at YouTube scale (billions of videos times billions of users). The standard two-stage approach: first a candidate generator (a fast model, often a two-tower neural network) produces a few hundred candidates per user from the catalog. Then a ranker (a slower, deeper model) scores each candidate based on watch-time predictions and other features. The ranker output drives the order on the home page. Real-time signals (just-watched, search query) feed an online layer that re-ranks before serving.
Serving billions of streams with global CDN
YouTube uses Google's global edge network. Popular videos are pushed to edge POPs proactively based on regional popularity predictions. The Playback API selects the nearest POP using IP geolocation. The client uses adaptive bitrate (HLS or DASH) to switch resolutions based on observed bandwidth. For long-tail videos that are not at the edge, the request fans back to a regional cache or origin. Edge caches are sized to hold the top 1 to 5% of videos, which serves over 90% of requests by view count.
Trade-offs to discuss
Every senior interviewer expects you to surface at least 3 of these. Pick the decisions, state the alternatives, and justify your choice.
Encode every video to AV1 vs only popular videos
AV1 cuts bandwidth ~30% vs H.264 but encoding is 5 to 10x slower. Encoding everything to AV1 is too expensive. The compromise: encode all uploads to H.264 immediately for fast availability, then promote videos that cross a popularity threshold to also have AV1 variants.
Spanner vs sharded MySQL for video metadata
Spanner gives you global consistency without manual sharding, at higher per-row cost. Sharded MySQL is cheaper per row but requires shard management. Google chose Spanner. A startup would not.
Streaming chunked upload vs single-shot
Single-shot fails on flaky networks and wastes bandwidth when retried. Chunked uploads add complexity but recover gracefully and let the server start encoding before the upload finishes (pipelined). Chunked wins for any file over a few MB.
Two-stage recommendations vs one-shot ranking
One-shot ranking has to score every candidate, which is computationally infeasible at YouTube scale. Two-stage (candidate gen plus ranking) lets you spend most compute on a few hundred candidates per user. Almost every large recommendation system uses this pattern.
Eager vs lazy CDN warming
Eager warming preloads popular videos to all POPs, which is wasteful for niche regional content. Lazy is the opposite. YouTube uses predictive warming: based on past viewing patterns, predict which videos a region will want and warm those. Long-tail videos cache-on-first-miss.
How YouTube actually does it
YouTube runs on Google's infrastructure: Spanner for metadata, Bigtable for watch logs, Borg and Kubernetes for compute, Google Cloud Storage for raw video. The encoding pipeline uses a chunked MapReduce-style architecture. Recommendation models train on TPUs using TensorFlow. The Playback API integrates Google's DRM (Widevine). Search uses a custom inverted-index system tightly integrated with Google's broader search infrastructure.
Lessons to study before this interview
If any of these topics are fuzzy, the interviewer will catch it. Each lesson is 15 to 60 minutes with diagrams, code, and a quiz.