Streaming Data
What are the benefits of streaming data?
“Traditional” big data analytics rose from the intersection of business intelligence and data science. Its purpose is to generate valuable insights from the vast quantities of historical data companies store in central repositories like data warehouses and data lakes. Streaming analytics emerged from the operational side of the business, where managers use real-time insights to manage things like manufacturing processes and network defenses. As the two data architectures converge, organizations are finding significant benefits in adding streaming data to their analytics infrastructure.
Increasing responsiveness
Streaming analytics lets companies continuously monitor data streams for anomalies, customer actions, and other events. Shortening reaction times helps mitigate adverse events like security breaches or equipment malfunctions before their impacts grow more costly.
Becoming a proactive organization
Traditional analytics is inherently backward-looking with its basis on a data repository’s historical datasets. Real-time streaming data can inform forward-looking machine learning projects. These predictive algorithms help companies anticipate future events, proactively mitigate risks, and seize opportunities.
Improving decision-making efficiency
Companies can feed real-time data from various sources into dashboards and other visualization tools. Ready access to the current state of KPIs and other business metrics empowers decision-makers to take effective action quickly. Integrating streaming data with analytics also lets companies automate many classes of routine decisions, freeing employees to focus on higher-order activities.
Real-time data and streaming analytics use cases
The benefits of real-time streaming analytics support use cases across industries, including:
Stock market trading: Milliseconds count in modern electronic trading. Rather than waiting for analysts to pull historical data from a data warehouse, trading firms use automated algorithms to act immediately on high volumes of market data. These sophisticated trading systems combine historical analysis with real-time evaluations of changing market conditions to decide which stocks to buy and sell.
Fraud detection in financial services: Besides improving financial performance, stream processor algorithms strengthen risk management and regulatory compliance. For example, when algorithms monitoring high-throughput streams of credit card transactions detect unusual events that may be the result of fraud, they will automatically notify customers to review the anomalous transaction. Financial institutions also use stream processing to identify signs of money laundering and other criminal activities for quick investigation.
Healthcare and patient monitoring: Patient monitoring is no longer limited to the hospital as portable and mobile technologies allow constant monitoring at home or the point of care. Besides improving patient quality of life, these real-time monitors provide rich datasets for evaluating treatments and medical technologies.
E-commerce and inventory management: Inventory risk enormously impacts on retail success, especially as omnichannel strategies unify e-commerce, brick-and-mortar, and other sales channels. Key to this strategy is the shift from historical sales forecasting to predictive demand forecasting. Streaming analytics helps retailers anticipate demand and allocate inventory for thousands of products to maximize in-stocks and sales while minimizing expensive markdowns.
Customer experience and personalized recommendations: Customer 360 is another critical success factor for omnichannel retail strategies. Retailers use hundreds of metrics to monitor and understand consumer behavior across all channels. Real-time sources like social media and website clickstreams provide data that can trigger automated responses with personalized experiences like pricing, promotions, and email campaigns.
Predictive maintenance: Modern factories use industrial IoT devices to measure equipment and process performance continuously. Patterns in this constant stream of data let algorithms recognize variances that could signal maintenance needs or imminent equipment failure. Maintenance systems can automatically order spare parts and schedule technician visits to avoid or minimize expensive downtime while optimizing staffing.
Supply chain: Today’s supply chains rely upon information from complex webs of third-party suppliers and distributors. Real-time analysis of shipment location data, transit times, pricing, and other information lets companies improve visibility into inventories around the world while maximizing the efficiency of their procurement and logistics systems.
Real-time applications vs Streaming applications
Traditional data analysis draws on datasets that may get updated daily, weekly, or monthly. Real-time and streaming data analytics happen much sooner.
Real-time applications run closest to their data sources so they can act within milliseconds. That speed matters for applications like financial trading or managing electricity grids.
Streaming applications, while still having lower latency than traditional analytics, are slightly less time-sensitive. Response times measured in seconds are fast enough to tailor a webpage for an inline customer or to generate alerts to information security teams.
Batch ingestion vs Batch processing
Analytics platforms like data lakes rely on a batch processing approach to data integration that copies large volumes of data to update repositories and keep them current for analysis.
Batch ingestion
Batch ingestion is the traditional method for updating data lakes. ETL pipelines, either running on a regular cadence or triggered by an event, copy files from the source, transform the data, and load the data into the lake.
These pipelines are relatively easy to audit for compliance or maintenance issues. However, batch ingestion introduces significant latency, which leaves consumers with steadily aging data.
Batch processing
Another downside to batch ingestion is the risk of failure due to data inconsistencies that the ETL pipeline can’t handle. This weakness is particularly troubling in large, complex analytics processes. Users must wait for the pipeline to run again — a delay that can stretch to days.
Starburst’s batch processing mode introduces fault-tolerant execution. Instead of shutting down, pipelines will pick up where they left off after a cluster outage or memory overflow.
Stream ingestion vs Stream processing
Traditional batch processing approaches are inappropriate for bringing in data from real-time sources. For example, data lakes store information in machine-readable, columnar file formats, while real-time sources use row-based text formats like JSON. Data teams must develop additional processes to bring streaming data into analytics platforms.
Stream ingestion
Stream ingestion brings data from real-time sources into a data lake using a variation of traditional ETL data pipelines to produce up-to-date datasets that users can query almost as soon as the data is generated. The pipeline’s extract phase chunks the incoming data stream into micro-batches for data processing within the pipeline.
Typically, data teams will combine open-source technologies to convert the row-based real-time data into the columnar format analytics platforms expect. A streaming platform like Apache Kafka will collect streaming events for Apache Flink to write to object storage. Query tools like Apache Spark will convert the streaming JSON files into Apache Parquet while consolidating file metadata within a metastore.
Stream processing
This patchwork approach to stream ingestion requires intensive engineering development and constant maintenance. Starburst stream processing provides a more straightforward path for bringing Kafka event streams into an Amazon S3-hosted data lake. Flink running on Amazon Kinesis converts the Kafka event stream into Iceberg format as an AWS Glue table. Once in the data lake, Starburst Galaxy’s analytics engine makes the near real-time data available for live interactive querying and latency-sensitive streaming applications.
Kafka to Iceberg: Store streaming data in S3 buckets with Starburst streaming ingestion
Assembling separate technologies to create a streaming ingestion workflow can quickly become expensive and difficult to manage. These processes can duplicate data, undermining the data lake’s quality. Furthermore, scalability is an issue as bespoke solutions require developing complex code to knit the various open source technologies together.
Streaming ingestion is a fully managed solution within Starburst Galaxy. Currently available in private preview in select AWS regions, this exciting new feature dramatically simplifies the ingestion of streaming data into your data lakehouse. A simple three-step process lets Starburst automatically ingest a Kafka-compliant topic’s data stream and write it to an Apache Iceberg table in your data lake.
Starburst’s streaming ingestion workflows deliver exactly-once guarantees, near real-time availability, and true scalability. Moreover, data engineering teams can take advantage of Apache Iceberg’s modern table format, which supports features like snapshot isolation, schema evolutions, time travel, and ACID transactions. At the same time, data consumers benefit from the ability to write ANSI-standard SQL statements to query low-latency updates to the data lake.
Starburst x Confluent Tableflow: Streaming ingestion, Icehouse, open data lakehouse
While open source has many advantages, enterprises may not want to devote scarce resources to low-level development and maintenance of technologies like Kafka. That’s why Kafka’s creators founded Confluent, a commercial service providing enterprise-grade streaming data solutions at scale.
Confluent’s recently announced Tableflow streaming integration feature lets companies link their streaming architecture with their Starburst data lakehouse analytics platform. Tableflow materializes Kafka topics into Iceberg tables and Parquet data files that are instantly accessible to Starburst users. Announced at Kafka Summit London 2024, Tableflow is in private early access.