What’s the difference between batch vs streaming data processing
Cindy Ng
Sr. Manager, Content
Starburst
Cindy Ng
Sr. Manager, Content
Starburst
Share
More deployment options
Batch and streaming data processing are techniques companies use to analyze data from very different sources. Although it dates back to the era of mainframe computing, batching remains an effective means for processing very large datasets. Streaming is relatively new and lets organizations take advantage of the large volumes of sparse data generated by today’s enterprise systems.
This article will discuss these approaches to data processing, their use cases, and data management implications. Note that batches and streams are not mutually exclusive and are important data sources for modern Icehouse data architectures.
What is batch processing?
Batch processing is the periodic collection and transformation of large datasets in groups or batches. This highly automated technique efficiently uses compute resources and often runs outside of business hours so as not to compete with daily operations.
Batch processing: Use cases and key features
Batch processing works best for jobs where compute-intensive, repetitive workflows must move or transform large amounts of data. Some use cases for batch processing include:
Data management: Generating backups and archives involves copying and moving large amounts of data.
Creating data repositories: ETL data pipelines ingest data into data warehouses and data lakes.
Big data analytics: Business dashboards and computationally complex analysis projects rely on processing large datasets.
Machine learning: Batches consolidate large volumes of data from multiple data sources to train machine learning algorithms.
Financial data processing: Scheduled batches are ideal for periodic financial data aggregation jobs like generating credit card statements.
Note that these use cases share a common property. Their outputs are not time-sensitive within a certain window. As a result, their input data can remain untouched for a period of time, allowing data teams to schedule jobs to run in overnight or weekend batches when they won’t compete for compute resources.
Once scheduled, however, batch runs must complete these jobs on time. They involve such large amounts of data and consume so many resources that rerunning a batch would disrupt other business operations.
Batch processing technologies and tools
Hadoop data processing systems use frameworks like Hive or Apache Spark to reliably batch process data at scale. Hive simplifies Hadoop MapReduce so engineers can use its SQL-like query language to build their jobs. Spark bypasses MapReduce entirely, allowing SQL queries to access the Hadoop Distributed File System directly.
Batch processing with Starburst and Trino
Starburst’s open data lakehouse analytics platform enhances Trino’s open source SQL query engine to streamline batch processing.
For example, Starburst simplifies data migration from legacy Hadoop systems to cloud-based object storage services like Amazon S3. The Starburst Hive connector abstracts the legacy system’s table and file formats to easily transfer data from HDFS to Iceberg tables on the object storage platform.
Fault tolerance in Starburst makes ETL pipelines more reliable. Enabling batch mode allows Starburst clusters to recover from interruptions like memory overflows or network congestion and resume a query without losing data.
Starburst also reduces the need for batch operations by replacing ETL pipelines with interactive queries. Since Trino uses ANSI-standard SQL, data consumers can query data themselves without needing engineers to build ETL pipelines.
What is stream processing?
Stream processing is a mechanism for collecting and transforming the output generated by continuous data sources like social media, e-commerce clickstreams, and Industrial Internet of Things (IIoT) sensors. Real-time processing introduces minimal latency and enables the near real-time analysis of data streams.
Stream processing: use cases and key features
Stream processing works with continuous data sources generating large amounts of sparse data, that is, a large stream of data with no beginning or end and few valuable data points.
Some use cases for stream processing include:
Anomaly detection: Transaction and activity logging systems generate a large amount of data, some of which could indicate anomalous behavior. Near real-time data analysis supports fraud detection in the financial industry and cybersecurity across all industries.
Predictive analysis: IIoT sensors report environmental information and the status of industrial processes. Stream processing collects this data for predictive models that anticipate equipment failure or process variances.
Customer experience: Analyzing e-commerce clickstreams in real-time lets automated systems tailor optimal customer experiences that maximize sales and customer satisfaction.
Market trading: Automated trading systems operate orders of magnitude faster than individual traders and rely on real-time weather, market, and other real-time data streams.
These use cases are significantly more time-sensitive than those for batch processing. They rely on close to real-time insights to drive effective actions, prioritizing low latency. In the case of market trading, a few milliseconds’ delay could cause significant trading losses. Automated marketing systems must respond on human timescales so they can accept latencies measured in seconds.
Technologies and tools: Streaming ingestion and processing
Open-source frameworks like Apache Kafka and Apache Flink let organizations build stream processing systems. For companies already using Apache Spark for batch processing, the framework’s Structured Streaming feature provides a unified API for processing data streams.
Stream processing with Starburst and Trino
Stream processing systems’ output often goes to a permanent repository for near real-time analysis or longer-term analytics. The challenge is ingesting these streams at scale without relying on complex code or risking data duplication.
Starburst’s streaming ingestion solution automatically ingests data streams from Kafka-compliant topics and writes the data to Iceberg tables in an open data lakehouse with exactly-once guarantees. This process converts the incoming streams into micro-batches for copying into the Iceberg table, which results in tables containing many very small files. Trino’s Iceberg connector lets data teams programmatically compact the Iceberg tables to contain fewer, larger files.
Key differences between batch processing vs. stream processing
These two data processing approaches serve different purposes and are not mutually exclusive. Most companies use both in their data architectures. Key differences between batch and stream processing include:
Bounded vs. unbounded data
Batches process complete, discrete datasets, which makes scheduling during periods of low resource utilization possible. Data streams have no beginning or end, so stream processing operates continuously.
High vs. low latency
Within the context of each job, latency is less critical than reliability. Whether it takes an hour or a weekend to process payroll doesn’t matter as long as the job is completed successfully on time. By contrast, stream processing use cases depend on receiving data as quickly as possible to close trades or catch security breaches.
Historical vs. real-time analytics
Batch processing is closely associated with historical analysis, as in financial reporting and machine learning. These applications don’t require the most recent data possible since they seek long-term business insights. Analyzing real-time data streams lets organizations react to events without delay, often in conjunction with predictive models.
Key differences between ingestion and processing
The two data processing approaches have several implications for how data teams ingest and process data for data repositories.
Batch processing implications
Batch ingestion periodically lands large chunks of data, which causes short-term spikes in network and storage utilization. Moving petabyte-scale batches from source to destination congests networks and requires carving out large staging areas in storage. The resource impacts of digesting such large amounts of data are why organizations typically schedule batches when network and system utilization rates are low.
Once ingested, a batch is processed all at once, even if complex jobs take hours or even days to complete. As a result, batch processing creates spikes in demand for compute resources. Without the scalability of a cloud services provider, organizations would overinvest to meet peak demand.
The high latency, periodic nature of batch processing depends on optimal scheduling to ensure jobs are completed on time without disrupting operational systems.
Stream processing implications
Unlike flexible, discontinuous batch processes, streaming processes are constant and must meet low-latency service levels. As a result, IT departments must invest in baseline capacity for network, compute, and storage.
Icehouse for data ingestion and data processing
An Icehouse architecture builds on Iceberg’s open table format, Trino’s massively parallel processing engine, and cloud object storage. Unlike data lakes or data warehouses, the Icehouse is not a central repository for all enterprise data. Trino’s connectors let organizations federate their data architecture, consolidating critical data in object storage while leaving infrequently accessed data at the source. ANSI-standard SQL queries can use one statement to access multiple sources simultaneously.
Founded by Trino’s developers, Starburst enhances the query engine’s core features with performance, management, and governance optimizations to streamline Icehouse administration.
A Starburst Icehouse replaces a data warehouse’s complex, difficult-to-manage ETL pipelines and rigid schema-on-write structure with ELT workflows and a schema-on-read paradigm. Whether ingesting data from streaming sources or in batches, Starburst lands raw data in Iceberg tables. Queries apply schema and transform data at runtime to keep data as useful as possible for future use cases.
Starburst streamlines the management of your Icehouse. Automatic capacity management features like auto-scaling, smart indexing, and enhanced fault-tolerant execution optimize Trino clusters. Data compaction, profiling, and other data lake optimization resources let you automate object storage management. Finally, Starburst’s universal discovery and governance resource, Gravity, automates cataloging and the enforcement of fine-grained role and attribute-based access controls.