Cloud Infrastructure
Every product you ship runs on somebody's infrastructure. The question is how much of it you own and how much you rent. Cloud infrastructure is the layer between your code and the physical machines in a datacenter: the network your traffic flows through, the boundaries that keep strangers out, the regions that decide how far a request has to travel, and the service models that determine whether you manage a server or never see one. Get this layer wrong and you pay for it in outages, security incidents, and surprise bills. A single misconfigured route table or an overly broad security group has taken down companies that had perfect application code.
This category walks the full stack from the ground up. You start with how addressing and routing actually work (static and dynamic IP, IPv4 vs IPv6, DHCP, CIDR, subnets, NAT), build into private cloud networks (VPC, internet gateways, NAT gateways, regions, availability zones), then move through the service models that define modern cloud (IaaS, PaaS, SaaS, serverless, FaaS), the access and perimeter patterns that keep production safe (bastion hosts, DMZ, network segmentation), and the connectivity and performance topics that tie multiple environments together (VPC peering, Transit Gateway, Direct Connect, Private Link, QUIC, service mesh). By the end you can read an architecture diagram and know exactly where every packet goes and why.
What Cloud Infrastructure Actually Is
Cloud infrastructure is the set of networking, compute, and connectivity primitives a provider gives you so you do not have to buy and rack physical hardware. At the bottom sits the network. Before anything else makes sense you need to understand how machines find each other, which is why this category begins with IP addressing. A Static IP never changes and is what you point a domain or a firewall rule at. A Dynamic IP is handed out and reclaimed automatically, usually by DHCP, which is fine for laptops and bad for servers other systems must reach reliably. CIDR notation (the /24 you keep seeing) is how you describe a block of addresses, and Subnets are how you slice that block into smaller zones with different rules.
On top of addressing sits the transport story. The OSI Model and the TCP/IP Stack are the mental maps for where each protocol lives, and they explain why a load balancer at layer 4 behaves differently from one at layer 7. TCP gives you ordered, reliable delivery; UDP trades reliability for speed, which is why it underpins video and gaming. NAT (Network Address Translation) is the trick that lets many private machines share one public address, and understanding it removes most of the confusion around why an instance can reach the internet but the internet cannot reach it.
Once the fundamentals click, the cloud-specific pieces are just managed versions of them. A VPC is your own private network inside the provider. An Internet Gateway connects that network to the public internet, a NAT Gateway lets private machines make outbound calls without being exposed, and an Elastic IP is a static public address you control across instance restarts. IPAM is how large organizations track all of these addresses before they collide.
The Service Models: How Much Do You Want to Manage
The deepest decision in cloud is how much of the stack you operate yourself, and that is what the as-a-service models describe. IaaS (Infrastructure as a Service) gives you raw virtual machines and networks; you control the operating system and everything above it, and you carry the operational weight that comes with that. PaaS (Platform as a Service) hands you a runtime where you deploy code and the provider handles patching, scaling, and the machines underneath. SaaS (Software as a Service) is finished software you simply log into. CaaS (Containers as a Service) and DBaaS (Database as a Service) are the same idea applied to containers and databases specifically.
Serverless pushes this further. With FaaS (Functions as a Service) you ship a single function and the provider runs it only when an event fires, scaling from zero to thousands of concurrent executions and back to nothing. BaaS (Backend as a Service) gives you ready-made auth, storage, and APIs so a frontend team can ship without a backend. The appeal is real: you pay for execution, not idle capacity, and you never patch a server.
The trade-off has a name, and it is Cold Start. When a function has been idle, the first invocation pays a startup penalty while the runtime spins up, which can add hundreds of milliseconds to a request. Warm Instances are the mitigation: keeping a pool ready so latency-sensitive paths never hit a cold start. The rule of thumb across these models is straightforward. Choose the highest level of abstraction that still meets your latency, control, and cost requirements, and drop down a level only when a concrete constraint forces you to.
Connectivity, Perimeter, and the Trade-offs That Matter
Real systems are never one network. You have production and staging, an on-premise datacenter, partner accounts, and edge locations, and they all need controlled paths between them. VPC Peering is the simplest: a direct one-to-one link between two networks. It stops scaling well once you have many networks, which is when Transit Gateway earns its place as a central hub that connects dozens of VPCs and on-prem links without a tangle of point-to-point connections. For the link to your own datacenter you choose between a VPN over the public internet (cheaper, encrypted, variable latency) and Direct Connect, a dedicated private line that costs more but delivers consistent performance. Private Link exposes a single service privately without opening up a whole network, which is how you let a partner reach one API and nothing else.
The perimeter topics decide who gets in. A DMZ is the buffer zone where internet-facing services live, separated from your internal systems. A Bastion Host or Jump Server is the single hardened door through which administrators reach private machines, so you audit one entry point instead of exposing SSH on everything. Network Segmentation is the broader discipline of splitting your network so a breach in one zone cannot spread to the rest. These are not optional polish; segmentation is what turns a single compromised box into a contained incident rather than a company-ending one.
The last group is about distance and speed. Regions are geographic clusters of datacenters, and Availability Zones are the isolated locations inside a region that let you survive a single datacenter failure. Spreading across zones buys you resilience; spreading across regions buys you both resilience and lower latency for distant users, at the cost of complexity and data-transfer fees. Edge Computing and Fog Computing push compute closer to users and devices to cut that latency further. On the wire, QUIC and TCP Optimization reduce round trips, WebRTC enables real-time peer connections, and a Service Mesh adds retries, encryption, and observability between your services without changing application code.
How Real Companies Run This
Netflix runs across multiple regions and availability zones precisely so that losing one datacenter, or even one region, does not take the service down, and they famously break their own infrastructure on purpose to prove the failover works. Their architecture leans on VPCs, careful segmentation, and edge placement so a stream starts fast no matter where you are.
Most large enterprises connect their datacenter to the cloud with Direct Connect for the predictable, low-latency private path it gives, then fall back to a Site-to-Site VPN as an encrypted backup. Banks and healthcare companies lean hard on network segmentation, DMZs, and bastion hosts because regulators require that a breach in one zone cannot reach patient or financial data in another. Fast-moving startups go the other direction and live on serverless and PaaS so a tiny team can ship without an operations group, accepting cold starts and provider lock-in as the price of speed.
The pattern underneath all of it is the same. Companies match the level of control to the actual constraint. They use the most managed option that still meets their latency, compliance, and cost needs, and they spend their scarce engineering attention on the boundaries between systems, because that is where both the outages and the breaches happen.