Primary Key
FoundationA column (or combination of columns) that uniquely identifies each row in a database table. Must be unique and not null.
Loading...
The most comprehensive System Design course ever built. 766 interactive lessons.
Start Learning303+ system design terms explained in plain English. Every concept links to an interactive lesson where you can learn it in depth with diagrams, code, and quizzes.
303 terms
Attribute-Based Access Control: grants permissions based on attributes (user department, resource type, time of day, IP range). More flexible but more complex than RBAC.
Four guarantees for database transactions: Atomicity (all or nothing), Consistency (valid states only), Isolation (no interference), Durability (changes persist).
Automatically notifying engineers when metrics cross predefined thresholds. Good alerts are actionable, not noisy. PagerDuty and Opsgenie route alerts to the right on-call person.
An agentless automation tool for configuration management, application deployment, and orchestration. Uses YAML playbooks and connects over SSH.
A background process that compares data between replicas and fixes differences. Uses Merkle trees to efficiently identify which data ranges are out of sync.
A distributed stream processing framework that handles both real-time streams and batch data with exactly-once guarantees. Used by Alibaba, Netflix, and Uber at massive scale.
A unified analytics engine for large-scale data processing with built-in modules for SQL, streaming, ML, and graph computation. Processes data in-memory for speed.
Application Programming Interface: a contract defining how two pieces of software talk to each other. The waiter between your frontend and your backend.
A pattern where a composer service queries multiple microservices and joins the results in memory. The simplest way to implement cross-service queries.
A single entry point for all client requests that routes them to the appropriate microservice. Handles auth, rate limiting, and request transformation.
A simple token passed with API requests to identify the calling project or application. Not a substitute for user authentication but useful for rate limiting and usage tracking.
Strategies for evolving an API without breaking existing clients. Common approaches: URL path (/v2/users), header (Accept-Version), or query param (?version=2).
Application Performance Monitoring: tools that track request latency, error rates, and dependencies in real time. Datadog, New Relic, and Grafana are popular APM platforms.
A repository for storing build outputs like Docker images, JAR files, and npm packages. Docker Hub, GitHub Container Registry, and Artifactory are common registries.
A communication model where the caller fires off a request and continues without waiting for a response. Essential for non-blocking I/O and event-driven systems.
A messaging guarantee where every message is delivered one or more times. Simpler than exactly-once but requires consumers to handle duplicates via idempotency.
Automatically adding or removing compute instances based on current demand. Scales out during traffic spikes and scales in during quiet periods to save cost.
The percentage of time a system is operational and accessible. Measured in 'nines' — 99.99% availability means about 52 minutes of downtime per year.
An isolated data center within a cloud region, with independent power, cooling, and networking. Deploying across multiple AZs protects against single-facility failures.
A self-balancing tree data structure used by most relational databases for indexes. Keeps data sorted and allows searches, insertions, and deletions in O(log n).
A flow control mechanism where a slow consumer signals upstream producers to slow down. Prevents systems from being overwhelmed by data they can't process.
A dedicated backend service tailored for a specific frontend (mobile, web, TV). Each frontend gets an API shaped to its exact needs instead of sharing one generic API.
Retroactively populating a new data store, index, or column with historical data. Typically done as a batch job when adding a new feature that needs past data.
The maximum amount of data that can be transferred over a network in a given time. It's the width of the pipe, not how fast the water flows.
An alternative to ACID for distributed systems: Basically Available, Soft state, Eventually consistent. Trades strong consistency for availability.
Processing large volumes of data in scheduled chunks rather than in real time. Think nightly reports, ETL jobs, and data warehouse loads.
Storage that splits data into fixed-size blocks, each with its own address. Used for databases and boot volumes where low-latency random I/O matters. AWS EBS is block storage.
A space-efficient probabilistic data structure that tells you if an element is 'possibly in the set' or 'definitely not in the set.' Used by databases to avoid expensive lookups.
A deployment strategy using two identical environments. Traffic switches from blue (current) to green (new) instantly, with easy rollback.
A pattern that isolates different parts of a system so a failure in one part doesn't sink the whole ship. Named after the compartments in a ship's hull.
The process of removing entries from a cache when it's full. Common policies: LRU (least recently used), LFU (least frequently used), FIFO (first in, first out).
Removing or updating stale cache entries when the underlying data changes. One of the two hard problems in computer science (along with naming things).
When many requests hit the database simultaneously because a popular cache entry expired. Solved with locking, probabilistic early expiration, or request coalescing.
A caching pattern where the application checks the cache first; on a miss, it fetches from the database and populates the cache. Also called lazy loading.
Storing frequently accessed data in a faster storage layer so you don't have to fetch it from the original (slower) source every time.
Automated statistical comparison of metrics between the canary (new version) and the baseline (current version) to decide whether to promote or roll back a deployment.
Rolling out a new version to a small percentage of users first, then gradually increasing. Like sending a canary into a coal mine to test for danger.
In a distributed system, you can only guarantee two of three: Consistency, Availability, and Partition tolerance. You must choose your trade-off.
Estimating future resource needs based on current usage trends and expected growth. Answers questions like 'how many servers do we need for 10x traffic in 6 months?'
A consistency model that preserves cause-and-effect ordering: if operation A causally precedes B, all nodes see A before B. Weaker than linearizability but stronger than eventual consistency.
A network of servers distributed globally that caches content close to users. Netflix uses CDNs to stream video from servers near you, not from one central location.
Capturing row-level changes in a database and streaming them to other systems in real time. Debezium reads the write-ahead log and publishes changes to Kafka.
Deliberately injecting failures into a system to test its resilience. Netflix's Chaos Monkey randomly kills servers to ensure the system survives.
Periodically saving the state of a stream processing job so it can recover from failures without reprocessing everything from the beginning. Flink and Spark use distributed checkpoints.
Continuous Integration and Continuous Deployment: automating the process of testing and deploying code. Push code, tests run, and it ships to production automatically.
A pattern that stops calling a failing service after repeated failures, preventing cascade failures. Like an electrical circuit breaker that cuts power to prevent fires.
A circuit breaker cycles through three states: Closed (requests flow normally), Open (requests are blocked after failures), Half-Open (a few test requests check if the service recovered).
The foundational architecture of the web: clients (browsers, apps) send requests and servers process them and return responses. Every web interaction follows this pattern.
The difference in time readings between clocks on different machines. Physical clocks drift; NTP helps but can't guarantee perfect sync. Why distributed systems use logical clocks.
A geographic area containing one or more data centers (availability zones). Choosing the right region reduces latency and satisfies data residency requirements.
A NoSQL database that groups columns into families, optimized for reading and writing large amounts of data across many machines. Cassandra and HBase use this model.
Gracefully finishing in-flight requests before removing a server from a load balancer's pool. Prevents dropping active connections during deployments or scale-ins.
A cache of reusable database connections that avoids the overhead of opening and closing a new connection for every query. Critical for high-throughput applications.
The process of getting multiple nodes in a distributed system to agree on a single value. The foundation of distributed databases and coordination services.
A hashing technique where adding or removing servers only moves a small fraction of keys. Used by Amazon DynamoDB and Cassandra for data distribution.
A lightweight, isolated environment that packages an application with its dependencies. Shares the host OS kernel, unlike VMs. Starts in milliseconds.
A lightweight, immutable package containing everything needed to run an application: code, runtime, libraries, and config. Built with a Dockerfile and stored in a registry.
The process of distributing and serving content to users from locations geographically close to them for faster load times.
The process where client and server agree on the response format using HTTP Accept headers. Enables serving JSON, XML, or HTML from the same endpoint.
An HTTP response header that restricts which resources (scripts, styles, images) a page can load. A powerful defense against XSS attacks by whitelisting trusted sources.
Cross-Origin Resource Sharing: a security mechanism that controls which domains can access your API. The browser enforces it; the server configures it.
Command Query Responsibility Segregation: using different models for reading and writing data. Reads and writes have different performance needs, so separate them.
Conflict-free Replicated Data Type: a data structure that can be updated independently on different nodes and merged automatically without conflicts. Powers real-time collaboration like Google Docs.
A searchable inventory of all datasets in an organization, with metadata like schema, owner, freshness, and lineage. The 'Google for your data.'
Transforming data into an unreadable format using cryptographic algorithms. Encryption at rest protects stored data; encryption in transit protects data over the network.
Splitting databases by function (users in one DB, orders in another) so each database handles only its domain. Simpler than sharding but limits cross-domain joins.
A centralized repository that stores raw data at any scale in its native format. Unlike a data warehouse, data doesn't need to be structured or cleaned before loading.
Tracking data from its origin through every transformation and system it passes through. Answers 'where did this number come from?' for audits and debugging.
Replacing sensitive data with realistic but fake values so developers and testers can work with production-like data without exposing real PII.
A decentralized data architecture where each domain team owns and publishes its data as a product. Shifts responsibility from a central data team to domain teams.
A central repository of structured, cleaned data optimized for analytical queries. Snowflake, BigQuery, and Redshift are purpose-built data warehouses.
An organized collection of data that can be easily accessed, managed, and updated. The backbone of almost every application.
Dividing a large table into smaller, more manageable pieces while keeping them in the same database. Sharding is partitioning across servers.
A microservices pattern where each service owns its private database. No other service can access it directly. Enforces loose coupling but complicates cross-service queries.
A piece of SQL that automatically executes in response to INSERT, UPDATE, or DELETE events on a table. Useful for audit logs and enforcing complex constraints.
A virtual table defined by a SQL query. Simplifies complex joins, enforces access control, and presents data in a specific shape without duplicating storage.
A queue that stores messages that couldn't be processed after multiple attempts. Prevents poison messages from blocking the main queue and lets you debug failures.
When two or more transactions are each waiting for the other to release a lock, creating a cycle where none can proceed. Databases detect and break deadlocks by aborting one.
Intentionally adding redundant data to database tables to speed up read queries by avoiding expensive joins. Trades storage and write complexity for read performance.
A plan and set of procedures for restoring systems after a catastrophic failure. Defined by RPO (how much data you can lose) and RTO (how long you can be down).
A cache spread across multiple nodes that acts as a shared layer between application servers and the database. Redis Cluster and Memcached are common implementations.
A lock that coordinates access to a shared resource across multiple machines. Implemented via Redis (Redlock), ZooKeeper, or etcd. Much harder than local locks.
Tracking a request as it flows through multiple services in a distributed system. Each service adds its trace, creating a full picture of the request journey.
The phonebook of the internet. Translates human-readable domain names (google.com) into IP addresses that computers understand.
Different types of DNS records map domains to resources: A records point to IPs, CNAME aliases one domain to another, MX routes email, TXT stores verification data.
A platform for packaging applications into lightweight, portable containers. 'Works on my machine' becomes 'works everywhere.'
A NoSQL database that stores data as flexible JSON-like documents. MongoDB and CouchDB let each document have a different structure.
Running computation at the network edge, close to the user, instead of in a central data center. Reduces latency for real-time applications like IoT and streaming.
A distributed search and analytics engine built on Apache Lucene. Powers full-text search, log analysis, and real-time analytics at scale.
The allowed amount of unreliability derived from SLOs. If your SLO is 99.9% uptime, your error budget is 0.1% (about 43 minutes/month). Once exhausted, freeze deployments.
Extract, Transform, Load: a pipeline that extracts data from sources, transforms it into the desired format, and loads it into a destination like a data warehouse.
A programming pattern that waits for and dispatches events in a single thread. Node.js uses it to handle thousands of concurrent connections without creating a thread per connection.
Storing every state change as an immutable event instead of just the current state. You can rebuild any past state by replaying events.
A design pattern where services communicate by producing and consuming events rather than making direct calls. Promotes loose coupling and asynchronous processing.
A consistency model where updates propagate asynchronously and all replicas will eventually converge to the same value. Trades immediacy for availability.
A processing guarantee where each message is processed exactly one time, even in the face of failures. Achieved through idempotent consumers and transactional producers.
A retry strategy that doubles the wait time between attempts (1s, 2s, 4s, 8s...) with random jitter. Prevents thundering herd problems when many clients retry simultaneously.
Deliberately introducing faults (latency, errors, crashes) into a system to verify that resilience mechanisms work. A specific technique within chaos engineering.
A system's ability to keep operating correctly even when some of its components fail. Achieved through redundancy, replication, and graceful degradation.
A toggle in code that enables or disables a feature without redeploying. Lets you ship code to production behind a flag and turn it on for a percentage of users.
A communication pattern where the sender dispatches a message and doesn't wait for or expect a response. Used for logging, analytics events, and non-critical notifications.
A visualization of profiled software showing which functions consume the most CPU or wall-clock time. The x-axis is the stack population and the y-axis is call stack depth.
A column in one table that references the primary key of another table, enforcing referential integrity between related tables.
A proxy that sits in front of clients and forwards their requests to the internet. Used for anonymity, content filtering, and bypassing geo-restrictions.
A planned exercise where teams simulate production failures to test incident response procedures and system resilience. Like a fire drill for your infrastructure.
General Data Protection Regulation: EU law governing how personal data is collected, stored, and processed. Requires consent, data portability, and the right to be forgotten.
DNS that returns different IP addresses based on the geographic location of the requester. Routes users to the nearest data center for lower latency.
Using Git as the single source of truth for infrastructure and application configuration. Changes are made via pull requests and automatically reconciled by tools like ArgoCD or Flux.
A load balancer that routes users to the nearest data center or region based on geography, latency, or health. DNS-based (Route 53, Cloudflare) or anycast-based.
A peer-to-peer communication protocol where nodes share information with random neighbors, spreading it like gossip. Used for cluster membership and failure detection.
A strategy where a system continues to function with reduced capability when a component fails, instead of crashing entirely. Show cached results when the database is down.
A visualization platform for creating dashboards from any data source. Connects to Prometheus, Elasticsearch, CloudWatch, and dozens more to display metrics, logs, and traces.
A database built for storing and querying relationships between entities. Nodes are entities, edges are relationships. Neo4j is the most popular.
A query language for APIs where the client specifies exactly what data it needs. No over-fetching, no under-fetching. One endpoint to rule them all.
A high-performance RPC framework by Google using Protocol Buffers and HTTP/2. Much faster than REST for service-to-service communication.
An index that uses a hash function to map keys directly to storage locations. O(1) lookups for exact matches but useless for range queries. Memcached and Redis use hash indexes.
An endpoint or mechanism that reports whether a service is running and healthy. Load balancers use health checks to route traffic away from unhealthy instances.
Periodic probes a load balancer sends to backend servers to verify they're alive. Unhealthy servers get pulled from the rotation until they recover.
A package manager for Kubernetes that bundles K8s manifests into reusable charts. Lets you install complex applications with a single command and manage versioned releases.
When a replica is temporarily down, another node stores writes intended for it as 'hints.' Once the replica recovers, the hints are replayed to catch it up.
A K8s controller that automatically scales the number of pods based on CPU utilization, memory usage, or custom metrics. The Kubernetes equivalent of auto-scaling groups.
Adding more machines to handle increased load (scaling out). Like opening more checkout lanes instead of making one cashier faster.
An uneven distribution of load where one node, shard, or partition receives disproportionately more traffic than others. Caused by poor shard keys or skewed access patterns.
The protocol powering the web. A request-response model where clients ask for resources and servers respond. Stateless by design.
Browser and proxy caching controlled by HTTP headers like Cache-Control, ETag, and Last-Modified. Eliminates redundant network requests for unchanged resources.
A major revision of HTTP that adds multiplexing (multiple requests over one connection), header compression, and server push. Much faster than HTTP/1.1 for modern web apps.
HTTP over TLS: the encrypted version of HTTP that protects data in transit. Every production site should use it; browsers flag plain HTTP as insecure.
Software that creates and manages virtual machines by abstracting physical hardware. Type 1 (bare-metal) runs directly on hardware; Type 2 runs on an OS.
An operation that produces the same result whether you run it once or multiple times. Critical for safe retries in distributed systems.
A message consumer that produces the same result whether it processes a message once or multiple times. Achieved by tracking processed message IDs or using natural idempotency.
Interface Definition Language: a schema that defines the contract between client and server. Protobuf, Thrift, and OpenAPI are IDLs that generate client/server code.
Servers are never modified after deployment. To update, you build a new image and replace the old instance entirely. Eliminates configuration drift.
The structured process for detecting, containing, eradicating, and recovering from security incidents. Includes communication plans, runbooks, and post-incident reviews.
A data structure that speeds up database lookups. Like the index at the back of a book that lets you jump to the right page instead of reading every page.
Managing servers, networks, and cloud resources through declarative configuration files instead of manual setup. Terraform, Pulumi, and CloudFormation are IaC tools.
Internet Protocol: the addressing scheme that routes packets across the internet. Every device gets an IP address (IPv4 or IPv6) so packets know where to go.
Controls how much one transaction can see changes made by other concurrent transactions. Ranges from Read Uncommitted (fastest, least safe) to Serializable (slowest, safest).
Incremental Static Regeneration: a Next.js feature that re-generates static pages in the background after a specified time interval, combining SSG speed with fresh data.
JavaScript Object Notation: a lightweight text format for data interchange using key-value pairs and arrays. The lingua franca of web APIs.
A distributed event streaming platform that handles millions of events per second. Used by LinkedIn, Netflix, and Uber for real-time data pipelines.
A set of consumers that cooperatively read from topic partitions. Each partition is assigned to exactly one consumer in the group, enabling parallel processing.
A subset of a Kafka topic that provides ordering guarantees and parallel processing. Each partition lives on one broker and can be consumed by one consumer per group.
A named feed or category to which producers publish records. Topics are split into partitions for parallelism, and each partition is an ordered, immutable log.
The simplest NoSQL model: store data as key-value pairs. Blazing fast lookups by key. Redis, DynamoDB, and etcd are key-value stores.
An orchestration platform that automates deploying, scaling, and managing containerized applications. K8s is the operating system for your cloud.
A K8s resource that stores non-sensitive configuration data as key-value pairs. Pods read ConfigMaps as environment variables or mounted files, decoupling config from code.
A K8s resource that manages rolling updates and rollbacks for a set of pods. You declare the desired state and K8s converges to it.
A K8s resource that manages external HTTP/HTTPS access to services inside the cluster. Routes traffic by hostname or URL path to different backend services.
A virtual cluster within a physical K8s cluster that isolates resources. Use namespaces to separate environments (dev, staging, prod) or teams.
The smallest deployable unit in Kubernetes: one or more containers sharing network and storage. Pods are ephemeral; if one dies, K8s creates a replacement.
A K8s resource for storing sensitive data like passwords, tokens, and certificates. Base64-encoded by default; use encryption at rest and RBAC to properly secure them.
A stable network endpoint that load-balances traffic across a set of pods. Pods come and go, but the Service's IP and DNS name stay constant.
A simple logical clock where each event increments a counter. If event A causes event B, A's timestamp is always less than B's. The foundation of logical time in distributed systems.
The time delay between sending a request and getting a response. Amazon found every 100ms of extra latency costs 1% in sales.
Load balancing at the transport layer (TCP/UDP) based on IP addresses and ports. Fast because it doesn't inspect packet contents.
Load balancing at the application layer (HTTP) that can route based on URL paths, headers, cookies, or request content. More flexible but more CPU-intensive.
The process of choosing one node in a cluster to coordinate actions. If the leader fails, a new one is elected. Used by Kafka, ZooKeeper, and etcd.
A replication approach where any node can accept reads and writes. Uses quorum reads/writes for consistency. Cassandra and DynamoDB use this model.
The strongest consistency guarantee: every operation appears to take effect atomically at some point between its invocation and completion. As if there's a single copy of the data.
Distributes incoming traffic across multiple servers so no single server gets overwhelmed. Like a traffic cop directing cars to different lanes.
The strategy a load balancer uses to distribute requests. Round-robin, least-connections, weighted, IP-hash, and random are common algorithms, each with different trade-offs.
Deliberately dropping low-priority requests during overload to protect the system's ability to serve high-priority traffic. Better to serve some requests than crash serving none.
Collecting logs from all services into a central searchable store. The ELK stack (Elasticsearch, Logstash, Kibana) and Grafana Loki are common solutions.
Recording discrete events with timestamps, severity levels, and context. Structured logs (JSON) are searchable; unstructured logs (plaintext) are not. Ship them to a central system.
The client sends a request and the server holds it open until new data is available or a timeout is reached. A workaround for real-time updates before WebSockets existed.
Log-Structured Merge Tree: a write-optimized data structure that buffers writes in memory and periodically flushes sorted runs to disk. Used by Cassandra, RocksDB, and LevelDB.
A programming model for processing massive datasets in parallel across a cluster. Map splits data into key-value pairs; Reduce aggregates them. Pioneered by Google, implemented by Hadoop.
A precomputed query result stored as a physical table and refreshed periodically. Trades storage for read performance on expensive aggregations.
A simple, high-performance distributed memory caching system. Stores key-value pairs in RAM. Simpler than Redis but less feature-rich.
Middleware that translates messages between different protocols, routes them to the right consumers, and provides guarantees like ordering and delivery. RabbitMQ and ActiveMQ are brokers.
A buffer that stores messages between producers and consumers. Messages are processed one by one, in order. Think of it as a to-do list for your services.
Numerical measurements collected over time that describe system behavior: request rate, error rate, latency percentiles, CPU utilization. Prometheus is the standard collector.
An architecture where an application is split into small, independent services that communicate over the network. Each service owns its own data and can be deployed separately.
A single, unified application where all features share the same codebase and deployment. Simpler to start with but harder to scale individual parts.
A guarantee that once you read a value, subsequent reads will never return an older value. Prevents the disorienting experience of data appearing to go backward in time.
Mutual TLS: both client and server present certificates to authenticate each other. Standard in service mesh architectures where every service verifies its peers.
Mean Time To Detect: the average time between a failure occurring and being noticed. Shorter MTTD means better monitoring and alerting. You can't fix what you don't know is broken.
Mean Time To Recovery: the average time from when a failure is detected to when the service is restored. A key reliability metric that drives investment in automation and runbooks.
A replication topology where multiple nodes accept writes independently and sync with each other. Useful for multi-datacenter setups but creates conflict resolution challenges.
A single software instance serving multiple customers (tenants) while keeping their data isolated. Can be achieved at the database, schema, or row level.
Network Address Translation gateway: allows instances in a private subnet to access the internet while preventing inbound connections. Keeps your backend servers invisible to the public.
A break in communication between nodes in a distributed system. Some nodes can't reach others. The 'P' in CAP theorem that forces the trade-off between consistency and availability.
Organizing database tables to reduce redundancy by splitting data into related tables connected by foreign keys. Follows normal forms (1NF, 2NF, 3NF).
An authorization framework that lets users grant third-party apps limited access to their accounts without sharing passwords. Powers 'Sign in with Google.'
A storage architecture that manages data as objects (file + metadata + ID) rather than blocks or files. S3 is the gold standard. Infinitely scalable, cheap, and durable.
The ability to understand a system's internal state from its external outputs. Built on three pillars: metrics, logs, and traces.
A vendor-neutral open standard for collecting metrics, logs, and traces from applications. Provides SDKs and a collector that ships telemetry to any observability backend.
Object-Relational Mapping: a library that lets you interact with a database using your programming language's objects instead of raw SQL. Drizzle, Prisma, and SQLAlchemy are ORMs.
A pattern where a service writes events to an outbox table in the same database transaction as its state change. A separate process reads the outbox and publishes events, ensuring atomicity.
Breaking large result sets into smaller pages. Offset-based (page=2&limit=20) is simple; cursor-based (after=abc123) handles real-time data without skipping or duplicating items.
A family of protocols for solving consensus in unreliable networks. Famously difficult to understand but mathematically proven correct.
A decentralized architecture where each node acts as both client and server. No central authority. Used by BitTorrent, blockchain, and WebRTC.
Authorized simulated attacks against a system to find security vulnerabilities before real attackers do. White-hat hackers probe for weaknesses in a controlled environment.
Personally Identifiable Information: any data that can identify a specific individual, like name, email, SSN, or IP address. Must be encrypted and access-controlled.
A blameless analysis conducted after an incident to document what happened, why, and how to prevent it from recurring. The most important output is the list of action items.
A column (or combination of columns) that uniquely identifies each row in a database table. Must be unique and not null.
A replication topology where one node (primary) handles writes and replicates changes to one or more read-only replicas. The foundation of most database scaling strategies.
An open-source monitoring system that scrapes metrics endpoints, stores time-series data, and supports powerful PromQL queries. The de facto standard for Kubernetes monitoring.
Google's language-neutral, binary serialization format. Smaller and faster than JSON. Defines schemas in .proto files that generate typed code for any language.
An intermediary server that sits between the client and the destination server. Forward proxies act on behalf of clients; reverse proxies act on behalf of servers.
A messaging pattern where publishers send messages to topics, and subscribers receive messages from topics they care about. Publishers don't know who's listening.
The minimum number of nodes that must agree for a read or write to succeed. With N replicas, W+R > N ensures overlap between write and read sets for consistency.
A consensus algorithm designed to be understandable. Uses leader election and log replication. Powers etcd (used by Kubernetes) and CockroachDB.
Controlling how many requests a client can make in a given time window. Protects your API from abuse and ensures fair usage.
Role-Based Access Control: assigns permissions to roles (admin, editor, viewer), then assigns roles to users. Simpler to manage than per-user permissions.
A technique where a read operation detects stale data on a replica and triggers a background update to bring it in sync. Used by Cassandra and DynamoDB to heal inconsistencies lazily.
A copy of your database that handles read queries, reducing load on the primary database. Writes still go to the primary and replicate out.
A consistency guarantee that after a user performs a write, their subsequent reads will always reflect that write. Without it, a user might save data and see the old version.
An in-memory data store used as a cache, message broker, and database. Blazing fast because everything lives in RAM.
Redis supports strings, lists, sets, sorted sets, hashes, streams, bitmaps, and HyperLogLogs. Each structure solves different problems: sorted sets for leaderboards, streams for event logs.
Duplicating critical components or functions so that if one fails, a backup takes over. The reason planes have two engines and databases have replicas.
Keeping copies of the same data on multiple servers. Improves read performance and provides fault tolerance if one server goes down.
The most basic communication pattern: one party sends a request and waits for the other to send a response. HTTP, REST, and gRPC all follow this pattern.
An architectural style for building APIs using standard HTTP methods (GET, POST, PUT, DELETE). Resources are identified by URLs.
Automatically re-attempting a failed operation, usually with exponential backoff. Essential for handling transient failures in distributed systems.
A server that sits in front of your backend servers and forwards client requests to them. Handles SSL termination, caching, and load balancing.
Gradually replacing old instances with new ones, a few at a time. No downtime, but both versions run simultaneously during the rollout.
A load balancing algorithm that distributes requests to servers in sequential order, cycling through the list. Simple but ignores server capacity differences.
Recovery Point Objective: the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can afford to lose up to 1 hour of data.
Recovery Time Objective: the maximum acceptable downtime after a disaster. An RTO of 15 minutes means the system must be back online within 15 minutes of failure.
A documented set of step-by-step procedures for handling specific operational tasks or incidents. Good runbooks reduce MTTR by giving on-call engineers a clear action plan.
A saga implementation where each service listens for events and decides what to do next independently. No central coordinator. Looser coupling but harder to trace.
A saga implementation where a central orchestrator tells each service what step to execute next. Easier to debug than choreography but creates a single point of coordination.
A way to manage distributed transactions across microservices using a sequence of local transactions with compensating actions for rollback.
A system's ability to handle growing amounts of work by adding resources. A scalable system maintains performance as load increases.
A centralized service that stores and validates schemas for event-driven systems. Confluent Schema Registry ensures Kafka producers and consumers agree on data formats.
Securely storing and distributing credentials, API keys, and certificates. Tools like Vault, AWS Secrets Manager, and SOPS prevent secrets from leaking into code or logs.
A systematic evaluation of a system's security posture against standards and best practices. Covers access controls, encryption, logging, vulnerability management, and compliance.
A guarantee that concurrent transactions produce the same result as if they were executed one at a time in some serial order. The gold standard for database transaction isolation.
Converting an in-memory data structure into a byte stream for storage or network transmission. Deserialization is the reverse. JSON, Protobuf, and Avro are serialization formats.
A cloud execution model where the provider manages all infrastructure and you pay only for actual compute time. AWS Lambda, Vercel Functions, and Cloudflare Workers are serverless.
The mechanism by which microservices find and communicate with each other. Services register themselves and others can look them up by name.
A dedicated infrastructure layer for handling service-to-service communication in microservices. Manages load balancing, encryption, and observability automatically.
A script the browser runs in the background, separate from the web page. Enables offline caching, push notifications, and background sync for progressive web apps.
A way to maintain state across multiple HTTP requests. The server stores data about a user and gives them a session ID (usually in a cookie).
A helper container deployed alongside the main application container in the same pod. Handles cross-cutting concerns like logging, monitoring, or TLS without modifying app code.
Service Level Agreement: a contractual commitment between provider and customer specifying uptime, response time, and penalties for breaches. The business version of an SLO.
Service Level Indicator: a quantitative measure of service behavior, like the proportion of requests faster than 300ms. The raw metric that feeds SLOs.
A rate limiting algorithm that tracks requests in a rolling time window. More accurate than fixed windows because it smooths out spikes at window boundaries.
Service Level Objective: a target value for an SLI, like '99.9% of requests under 300ms.' The internal engineering goal that drives reliability investment.
A quick, basic test run after deployment to verify the most critical paths still work. If the smoke test fails, the deployment is rolled back immediately.
A consistency level where each transaction reads from a consistent snapshot of the database taken at the transaction's start time. Prevents dirty reads without full serializability overhead.
A single unit of work in a distributed trace, representing one operation (e.g., an HTTP call or database query). Spans are nested to form a trace tree.
A failure scenario where a network partition causes two halves of a cluster to operate independently, each believing it's the leader. Can cause data corruption if not handled.
Structured Query Language for managing relational databases. Tables, rows, columns, and powerful joins to query related data.
An attack where malicious SQL is inserted into a query through user input. Prevented by parameterized queries and prepared statements. Never concatenate user input into SQL.
Google's discipline for running reliable production systems. Applies software engineering to operations: automation over toil, SLOs over uptime promises, and error budgets for velocity.
Server-Sent Events: a one-way channel where the server pushes updates to the client over HTTP. Simpler than WebSockets when you only need server-to-client streaming.
Decrypting TLS-encrypted traffic at a load balancer or reverse proxy so backend servers receive plain HTTP. Offloads CPU-intensive crypto work and simplifies certificate management.
Cryptographic protocols that encrypt data in transit between client and server. TLS is the modern successor to SSL. The 'S' in HTTPS.
Server-Side Rendering: generating HTML on the server for each request. Slower than SSG but always returns fresh data. Good for personalized or frequently changing pages.
Running the same deterministic state machine on multiple nodes, applying the same commands in the same order. If they start in the same state and see the same inputs, they stay in sync.
A system that remembers previous interactions. The server keeps track of client state between requests, making it harder to scale but sometimes necessary.
A system where each request contains all the information needed to process it. The server doesn't remember previous requests. Easier to scale horizontally.
Pre-rendering pages to static HTML at build time. The fastest possible page loads because there's no server-side computation on each request.
A load balancer feature that routes all requests from the same client to the same backend server. Needed when servers store session state locally.
Pre-compiled SQL code stored in the database that can be called by name. Reduces network round trips and keeps business logic close to the data.
A migration strategy where you gradually replace a monolith by routing features one by one to new microservices, until the old system can be decommissioned.
Processing data continuously as it arrives, rather than in batches. Powers real-time analytics, fraud detection, and live dashboards.
A guarantee that after a write completes, all subsequent reads will return the updated value. Safer but slower than eventual consistency.
A communication model where the caller waits for the operation to complete before moving on. Simpler to reason about but blocks the thread.
The high-percentile response times (p99, p99.9) that affect the slowest requests. A system with 10ms median but 2s p99 latency feels slow for 1 in 100 users.
A reliable transport protocol that guarantees data arrives in order and without errors. It uses a three-way handshake to establish connections.
An open-source IaC tool by HashiCorp that provisions infrastructure across any cloud provider using declarative HCL configuration. Plan, apply, destroy.
A structured process for identifying security threats, attack surfaces, and mitigations during system design. STRIDE and DREAD are common frameworks.
An extension of two-phase commit that adds a pre-commit phase to reduce the blocking window. Still not widely used in practice due to complexity; sagas are preferred.
Slowing down the rate of processing requests instead of rejecting them outright. The gentler cousin of rate limiting.
The number of operations a system can handle per unit of time. Think of it as how many cars a highway can move per hour.
When many clients simultaneously retry or reconnect after a failure, overwhelming the recovering system. Solved by jittered backoff, request coalescing, and admission control.
A database optimized for time-stamped data points like metrics, sensor readings, and financial ticks. InfluxDB and TimescaleDB are purpose-built for this.
A maximum duration to wait for an operation to complete before giving up. Without timeouts, a stalled dependency can hang your entire system indefinitely.
Manual, repetitive, automatable operational work that scales linearly with service size. SRE teams aim to keep toil below 50% of their time and automate the rest.
A rate limiting algorithm where tokens are added to a bucket at a fixed rate. Each request consumes a token; requests are rejected when the bucket is empty. Allows short bursts.
A marker that indicates a deleted record in a distributed database. Since replicas may not see the delete immediately, the tombstone prevents deleted data from being resurrected.
The full end-to-end path of a request through a distributed system, composed of multiple spans. A trace shows which services were called, in what order, and how long each took.
A sequence of database operations treated as a single atomic unit. Either all operations succeed (commit) or none of them do (rollback).
A branching model where developers commit to a single main branch frequently with small changes. Feature flags replace long-lived feature branches.
Time To Live: how long a cached entry, DNS record, or packet is valid before it expires and must be refreshed or discarded.
A methodology for building cloud-native applications with 12 principles: codebase, dependencies, config, backing services, build/release/run, processes, port binding, concurrency, disposability, dev/prod parity, logs, admin processes.
A protocol ensuring all nodes in a distributed transaction either commit or abort together. Phase 1: prepare (vote). Phase 2: commit or rollback.
A logical clock that tracks causality across distributed nodes using a vector of counters. Each node increments its own counter and merges vectors on message receipt.
A database optimized for storing and querying high-dimensional vectors. Powers AI applications like semantic search and recommendation systems.
Making a single machine more powerful (more CPU, RAM, storage). Simpler but has physical limits. Also called 'scaling up.'
A software emulation of a physical computer running its own OS on shared hardware. Heavier than containers but provides stronger isolation.
Virtual Private Cloud: a logically isolated section of the cloud where you launch resources in a virtual network you define. Controls IP ranges, subnets, route tables, and gateways.
Web Application Firewall: filters and monitors HTTP traffic between a web application and the internet. Blocks SQL injection, XSS, and other OWASP top-10 attacks.
A timestamp that tracks how far a stream processing system has progressed through event time. Tells the system when it's safe to close a window and emit results, even with late-arriving data.
An HTTP callback triggered by an event. Instead of polling for updates, the source system pushes a notification to your URL when something happens.
A protocol for full-duplex communication over a single TCP connection. Unlike HTTP, the server can push data to the client without being asked.
Grouping stream events into finite time-based or count-based windows for aggregation. Tumbling windows don't overlap; sliding windows do; session windows group by activity gaps.
A technique where changes are written to a log before being applied to the database. Ensures durability and crash recovery.
A caching pattern where writes go to the cache first and are asynchronously flushed to the database later. Fast writes but risks data loss on cache failure.
A caching pattern where every write goes to both the cache and the database simultaneously. Ensures consistency but adds write latency.
Extensible Markup Language: a verbose, tag-based format for structured data. Still used in enterprise systems, SOAP, and configuration files.
A security model that never trusts any request by default, even from inside the network. Every request must be authenticated, authorized, and encrypted regardless of origin.