Stream and Batch Processing
Every fraud charge a bank blocks, every "trending now" list, every dashboard a finance team trusts at quarter close comes down to one question: do you process data after it lands, or as it arrives? Get that choice wrong and you either pay for fraud you could have caught in 200 milliseconds, or you build a real-time pipeline so complex it falls over the first time events arrive out of order. Stream and batch processing is the set of techniques for moving, shaping, and computing over large volumes of data, and knowing when each one fits is what separates a pipeline that survives production from one that pages you at 3 AM.
This category covers both ends of that spectrum and the architectures that blend them. You will learn the building blocks of batch (batch processing, batch analytics, MapReduce, ETL, ELT) and the harder discipline of streaming, where data never stops and you have to reason about time itself: event time versus processing time, windows, watermarks, late and out-of-order data, and how to keep state correct across failures with checkpointing and exactly-once semantics. It closes with the system-level patterns, Lambda and Kappa architecture, plus the governance and organizational layers (data catalog, lineage, data mesh, data fabric) that keep all of this usable at scale.
What Stream and Batch Processing Actually Means
Batch processing collects data into a bounded set and computes over all of it at once, usually on a schedule. You run a job at 2 AM that reads yesterday's orders, joins them against the product table, and writes a sales report. The defining trait is that the input is finite and known before the job starts. MapReduce was the model that made this idea scale across thousands of machines, and the modern descendants (Spark, BigQuery, Snowflake) still follow the same shape: read a lot, transform, write a lot. Batch analytics is what most companies run for reporting, finance, and machine learning training, because correctness and cost matter more than freshness.
Stream processing computes over an unbounded sequence of events as they arrive. There is no end to the input. A payment event shows up, you check it against the last hour of activity, and you decide in milliseconds. Because the data never stops, you cannot wait for all of it, so you carve the stream into windows and compute partial answers continuously. Stateless stream processing handles each event independently (filter, map, enrich), while stateful stream processing remembers things across events (running counts, sessions, joins), which is where the real engineering lives.
Micro-batching sits in the middle. Instead of acting on every single event, you collect events for a short interval, a few hundred milliseconds to a few seconds, and process that tiny batch. Spark Structured Streaming popularized this. You trade a little latency for much simpler fault recovery and higher throughput, which is often the right deal.
Moving Data: ETL, ELT, and the Pipeline
Before you can compute, you have to move and shape data, and the two dominant patterns are ETL and ELT. ETL (extract, transform, load) cleans and reshapes data in a dedicated processing layer, then loads the finished result into the warehouse. It was the standard when storage and compute were expensive and tightly coupled, because you only stored what you needed. An ETL pipeline is a chain of these steps, often orchestrated by a scheduler with retries, dependencies, and alerting.
ELT (extract, load, transform) flips the last two steps. You dump raw data into a cheap, scalable warehouse first, then transform it in place using the warehouse's own compute. Cloud warehouses made this the default for most new builds, because separating storage from compute means you can keep the raw data forever, reprocess it when business logic changes, and let SQL-fluent analysts own transformations instead of a dedicated engineering team. The trade-off is that you store and pay for raw data you may never use, and governance becomes harder because anyone can transform anything.
This is also where the governance lessons earn their place. A data catalog tells people what data exists and what it means. Metadata management keeps the descriptions, schemas, and ownership current. Data lineage traces a number on a dashboard back through every transformation to its source, which is the difference between fixing a bad metric in an hour versus a week. None of these compute anything, but without them a large pipeline becomes a system nobody trusts.
The Hard Part of Streaming: Time, Windows, and State
In batch, time is simple because the data is already complete. In streaming, time is the central problem. Event time is when something actually happened (the timestamp on the click). Processing time is when your system saw it. These drift apart constantly: a phone goes through a tunnel, buffers events, and delivers them ten minutes late. If you compute on processing time you get fast but wrong answers; if you compute on event time you get correct answers but have to handle data that arrives out of order.
Windows are how you bound an infinite stream into computable chunks. Tumbling windows are fixed and non-overlapping (every 5 minutes, no gaps, no overlap). Hopping windows overlap by advancing on a smaller step than their size. Sliding windows recompute on every event for a rolling period. Session windows group bursts of activity separated by gaps of inactivity, which is how you measure a user's session without a fixed length. These split further into time-based windows and count-based windows depending on whether you bound by clock or by number of events.
Watermarking is the mechanism that makes event-time windows finishable. A watermark is the system's assertion that it has probably seen all events up to a given time, so a window can close and emit its result. Late data handling and out-of-order processing decide what happens to stragglers that arrive after the watermark: drop them, update the result, or route them to a side output. Keeping all of this correct across machine failures requires state management in streams, checkpointing strategies to snapshot that state durably, and exactly-once semantics so a crash and replay does not double-count a payment. Stream joins, stream-to-stream joins, stream-to-table joins, and change streams are how you combine multiple live feeds, and they are the most state-heavy operations of all.
Choosing an Architecture and the Trade-offs
Once you have both batch and streaming, you have to decide how they coexist, and there are two well-known answers. Lambda architecture runs a batch layer for accurate, complete results and a speed layer for fast, approximate ones, then merges them at query time. It gives you correctness and freshness, but you pay by writing and maintaining the same business logic twice, once in batch code and once in streaming code, and keeping them in agreement forever.
Kappa architecture removes the batch layer entirely. Everything is a stream, and when you need to recompute history you replay the event log through the same streaming code. This eliminates the duplicate-logic problem and is cleaner to operate, but it demands a durable, replayable log (Kafka and similar) and a streaming engine strong enough to handle both live and historical loads. The practical rule: reach for streaming when the decision must happen now (fraud, alerting, live dashboards, personalization), reach for batch when correctness and cost dominate (billing, reporting, model training), and choose Kappa over Lambda unless you have a genuine reason your batch and stream logic must differ.
At the largest scale, the question stops being about jobs and becomes about ownership. Data mesh treats data as a product owned by the teams that produce it, decentralizing pipelines instead of funneling everything through one central data team that becomes a bottleneck. Data fabric is the complementary idea of a unified access and integration layer over data that lives in many places. These are organizational and platform patterns, not processing engines, but at enterprise scale they decide whether your stream and batch investments actually get used.
How Real Companies Use This
Netflix runs a Kappa-style backbone on Kafka and Flink, processing trillions of events a day to power recommendations, A/B test results, and operational alerting, while keeping heavy batch jobs for model training and long-range analytics. The event log is the source of truth, and reprocessing means replaying it.
Uber's surge pricing and ETA models depend on stream processing with tight event-time windows and watermarks, because a price computed on stale or out-of-order GPS pings is worse than useless. They pair this with massive batch pipelines for billing reconciliation and analytics, a textbook split of streaming for the live decision and batch for the audited, correct number.
Financial systems lean on exactly-once semantics and checkpointing the hardest, because a duplicated or dropped transaction is not a metric error, it is money. They typically run a streaming path for fraud detection and instant alerts, and a batch path for end-of-day settlement, with strong data lineage so every figure is traceable for auditors. The common thread across all three is that they do not pick stream or batch, they pick both, and the skill is knowing which problem belongs to which.