Traditional batch processing moves large amounts of data into a data warehouse all at once to support historical analyses. Stream processing lets companies make better decisions from the real-time data streams emerging from manufacturing, e-commerce, and other sources.
This guide will introduce near real-time data analytics and explain how an Icehouse architecture optimizes streaming ingestion.
Real-time analytics use cases
Conventional data analytics seeks to discover insights from historical data sets stored in repositories like a data warehouse. Strategic and near-term decision-making still depends on this kind of analysis, however, the availability of real-time data sources allows companies to make data-driven decisions on time frames measured in minutes or milliseconds. A few examples of real-time analytics use cases include:
Customer 360
Customer 360 systems let retailers understand customer behavior in the moment. They consolidate a steady stream of customer data, from website clickstreams to social media posts, to model and predict purchasing intent. Automated systems use this near real-time data to modify the customer experience while the customer is shopping to improve conversion rates and increase revenue.
Healthcare
Healthcare is an increasingly data-driven industry. In the past, doctors and nurses collected patient data periodically, with hours or months separating each check. Today, medical sensors generate steady data streams from the hospital bed, while wearable connected devices monitor patients’ chronic illnesses at home. Analyzing these streams near real-time quickly identifies worsening conditions, resulting in faster interventions and better patient outcomes.
Financial services
Regulators require financial services providers like banks and investment firms to monitor customer transactions for signs of fraud, money laundering, and other criminal activity. Waiting for overnight batch processes adds costly delays and increases regulatory risks. Fraud detection systems rely on the real-time analysis of customer transactions to identify unusual user behavior as it happens do financial service providers can act immediately.
Data science
Artificial intelligence and machine learning models can take months to develop. Once operational, they often depend on a steady stream of real-time data. For example, supply chain management systems use machine learning algorithms in predictive analytics tools. These algorithms can route shipments more efficiently by incorporating real-time feeds from logistics, manufacturing, weather services, and other sources.
Icehouse architecture for near real-time analytics
An Icehouse is a new kind of data architecture, also called an open data lakehouse, that combines the flexible scalability of data lakes with the full-featured performance of data warehouses. Icehouses decouple storage from compute, delivering low-latency performance on petabyte-scale data sets while affordably scaling storage. The Icehouse’s foundation comprises three elements: cloud object storage, Iceberg’s open table format, and the Trino SQL query engine.
Cloud object storage: Commodity pricing and on-demand scalability make services like Amazon S3 or Azure Blob Storage better suited than on-premises data centers for the volume, velocity, and variety of modern enterprise data.
Iceberg: Netflix developed the Iceberg open table format to overcome Hive’s performance issues with large datasets. Iceberg tables use rich metadata to support warehouse-like features, including ACID compliance, schema evolution, and time travel.
Trino: This massively parallel open query engine delivers low-latency data processing while democratizing data access through ANSI-standard SQL. Connectors to over fifty enterprise data sources create a federated data architecture that lets a single Trino query access multiple sources.
Starburst Galaxy’s fully managed Trino implementation enhances the open-source query engine’s performance and governance, allowing data teams to manage this federated architecture through a unified interface. A Galaxy-based Icehouse becomes the single point of access to all data sources, letting users query data ingested from real-time sources as easily as historical data sets.
These technologies power the Icehouse’s four defining capabilities: data ingestion, data governance, data management, and autoscaling cluster management.
Data ingestion
Engineers use Trino’s SQL query engine in fault-tolerant mode to create reliable ETL data pipelines that ingest batched data at scale. With Trino’s Apache Kafka connector, they can ingest data from real-time sources. Starburst’s streaming ingestion automatically brings streams from Kafka-compliant topics into an Iceberg table where they are immediately available for analysis.
Data governance
Starburst enhances Trino’s native security features within a central control panel that lets administrators create granular role and attribute based access controls down to the column and row levels. An API lets companies integrate Starburst with their existing security stacks.
Data management
Iceberg tables support data management features like schema evolution, table compaction, and ACID compliance. Starburst Galaxy automates these activities to alleviate pressures on data team resources.
Automatic capacity management
Starburst Galaxy’s cluster autoscaling lets admins customize infrastructure to achieve the right balance of cost and performance. Galaxy’s auto-suspend feature, which spins down inactive clusters, can also lead to further savings.
Near real-time data with streaming ingest
Starburst’s streaming ingestion simplifies the process of bringing real-time data into an Icehouse. Administrators view connections to Kafka or Kafka-compliant systems like Confluent as streams. Starburst continuously reads messages from the stream, applies a relational structure, and writes the data to an Iceberg table. The data becomes immediately accessible with exactly-once guarantees, ensuring the data’s freshness and accuracy.
Integrating real-time data sources with Starburst’s streaming ingestion lets data teams:
- Continuously ingest data at scale in a highly automated way.
- Make data available in near real-time to support the speed of modern decision-making.
- Reassure users that near real-time data sets are accurate and up-to-date.
In the past, landing streaming data in a data lake required expensive, time-consuming development to jury rig various tools into a one-off fix. Streaming ingest replaces this bespoke code with a simple, automated feature in Starburst Galaxy’s management interface.
Benefits of near real-time analytics on the open data lakehouse
Using an open data lakehouse for near real-time analytics makes data management more efficient while improving the quality of business insights
Data management efficiencies
Since data warehouses handle streams of unstructured data poorly, organizations must maintain multiple analytics platforms. A warehouse containing historical data supports business intelligence analysis and data science projects, while one or more real-time solutions support operations management.
An Icehouse unifies data processing within a single analytics solution. Eliminating redundant real-time applications yields near-term cost savings, but the Icehouse creates several improvements in data management productivity.
The previous section discussed how Starburst Galaxy’s streaming ingest reduces the development and management expense of integrating real-time data sources. An Icehouse further simplifies data management workflows by eliminating many batch processing pipelines with SQL queries. Galaxy’s fault-tolerant execution makes the remaining pipelines more performant and reliable.
Data consumers’ demands for engineering resources fall significantly since Galaxy’s SQL support lets them access data sources directly through their preferred analytics and data visualization software. As soon as Galaxy writes data from real-time sources into the Icehouse’s Iceberg tables, the data is available for authorized users through the company.
Unveiling insights in near real-time
Unifying real-time sources within the Icehouse’s federated architecture helps organizations generate better insights faster. Traditional data processing systems like Hadoop helped companies address the challenge of volume at the dawn of the big data era. However, they were not well-suited for the high volumes of sparse unstructured data real-time systems generate.
The value in data streams resides in relatively few individual data points. Real-time analytics platforms ingest and process these streams to find the valuable data points that inform decisions for the kinds of use cases discussed earlier.
For example, network activity logs report vast quantities of perfectly normal user activity. Individual analysts could review the data all day and never spot anything unusual. Often paired with machine learning algorithms, real-time analytics quickly spots a potential breach and notifies security teams for further action.
Near real-time analytics also feeds data to operational dashboards. With almost instantaneous metrics at their fingertips, frontline personnel better understand the state of systems performance.
Icehouse: Optimized faster decision-making
The benefits of real-time analytics ultimately support accelerated decision-making. Unlike the years-long horizon of corporate strategy or a sales manager’s review of yesterday’s numbers, many decisions must be made as close to real-time as possible. Fraud detection and cybersecurity systems must identify incidents quickly to minimize response times. E-commerce systems have as little as a minute to influence customers’ shopping behaviors.
Legacy systems limited to historical data sets struggle to keep pace with the velocity of modern enterprises. Resource-intensive data pipelines and time-consuming batch processing add too much latency.
An Icehouse based on Starburst Galaxy’s open data lakehouse analytics platform provides the near real-time insights the organization needs to make better informed, more responsive business decisions.