Open source and cloud computing have steadily decentralized data architectures. Rather than settling for a monolithic, proprietary solution, companies can design their data platforms using the open-source technologies most appropriate to their unique use cases. The widespread adoption of cloud storage, open table formats, and open-source compute engines, like Trino, decouple storage from compute, letting data teams optimize each independently for cost and performance.
This guide will review the rise of open data stacks, Trino’s relationship with Apache Iceberg, and how they combine with Starburst Galaxy’s Trino implementation to create the Icehouse data architecture.
Database: compute vs storage
As with other aspects of information technology, data analytics has seen a steady evolution from closed, centralized systems to open, modular data stacks. More importantly, data architectures have decoupled storage from compute, allowing the optimization of each. Companies went from accepting the cost and constraints of packaged solutions to using free, open-source technologies to build scalable open data stacks.
From monolithic to modular analytics architectures
The industry historically positioned data analytics platforms as monolithic products when, in fact, databases consist of two core functions: storage and compute. Whether a company bought a data warehouse from IBM, Oracle, Snowflake, or Teradata, it had to accept the vendor’s bundled pricing structures. Getting enough compute meant over-provisioning storage and vice versa. Proprietary software sold on proprietary hardware made advanced analytics expensive, especially as the era of big data sent data volumes skyrocketing.
Hadoop provided an alternative by letting companies run an open-source platform on the general-purpose CPUs, local SSDs, and hard drives in their data centers. The Apache project’s core technologies and broader ecosystems emerged from companies like Netflix, Yahoo!, and Uber that needed a more affordable way to process vast amounts of data. Hadoop made big data analytics accessible to companies that could not otherwise justify the cost of proprietary data warehouses.
Even so, Hadoop tightly coupled storage and compute. Scaling capacity required expensive, long-term investments in physical infrastructure. Still, Hadoop made
Moving analytics to the cloud
Cloud-based Infrastructure-as-a-Service (IAAS) providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) made it easier for startups and established enterprises to develop their IT systems without expensive physical infrastructure and without the traditional coupling of compute and storage. Cloud computing services like Google Compute Engine (GCE) let engineers spin up virtual machines or containers like Kubernetes or Docker on demand without being locked into a fixed computing capacity. Cloud storage services like Amazon S3 let companies efficiently store data as objects while taking advantage of the cloud’s unlimited storage volume.
Proprietary analytics platforms grudgingly accepted the cloud model as more customers demanded similar scalability. Likewise, Hadoop has become more cloud-friendly with ecosystem options like Apache Ozone and Hadoop-in-the-cloud services like Amazon EMR. Yet, these measures did not give data teams the freedom to manage storage and compute independently.
Iceberg enables open data stacks
Recently, this final barrier fell as Snowflake and Databricks announced the adoption of Iceberg’s open table format. This Iceberg integration lets companies replace the vendors’ expensive storage options with more cost-effective object storage services. Moreover, data once bound within these data warehouses and inaccessible to anything but vendor-approved analytics tools are now accessible to Iceberg-compatible compute engines.
This development lets data teams build truly open data stacks that treat proprietary systems as one-of-many data sources rather than as the centralized source of truth. Engineers can choose the technologies that make the most sense for the business — an increasingly important freedom with the mainstreaming of machine learning and generative AI.
Trino: open source compute engine
Developers at Netflix designed Iceberg to support Trino-based queries, and the two technologies remain closely aligned. Trino is quickly becoming the compute engine of choice for open data stacks because it delivers accessibility, performance, and optionality for big data workloads.
ANSI-standard SQL
Commercial data warehouses often used proprietary query languages or unique SQL implementations that helped foster vendor lock-in. Hadoop MapReduce’s obscure version of Java made data access similarly difficult. Trino lets users generate ANSI-standard SQL queries, opening access far beyond the data team.
Data scientists can build queries into Python or other code or execute statements through a command-line interface. Analysts can use their preferred business intelligence applications to query data sources without an engineer’s assistance.
Massively parallel processing
A Trino cluster consists of multiple virtual machine instances to enable massively parallel processing on huge data volumes. A user’s query arrives at the Coordinator node, which parses, plans, and schedules the query tasks. Worker nodes process these tasks in parallel across multiple threads.
Trino was designed to process large interactive queries without the expensive write operations that Hadoop imposes. Pushdown queries, filtering, and other features further improve query latency. As a result, the open compute engine can deliver results in seconds or minutes rather than the minutes or hours of legacy platforms.
Connectors and federation
Unlike the monolithic platforms of the past, Trino does not try to create the mythical single source of truth. Enterprise data is too diverse. Instead, Trino connectors link all enterprise data sources within a virtual access layer. This single source of access lets analysts generate deeper insights to support better decision-making.
Data teams also get more flexibility. Rather than trying to consolidate all data within one repository, engineers can leave sparse, unstructured, or less frequently accessed data at the source. Avoiding costly and resource-intensive integration makes data management more efficient.
Use cases beyond interactive queries
Although developed as an interactive query engine, Trino improves other data management use cases. For example, the compute engine uses fault tolerance execution to make ETL data pipelines more reliable. Furthermore, Trino’s Kafka connector lets data teams integrate streaming sources to support near real-time analytics.
Starburst Galaxy: fully managed service
Starburst Galaxy is a cloud-based Trino implementation. This fully managed service enhances the compute engine’s performance and adds discovery, governance, and security features to manage a unified data architecture.
Trino enhancements
Warp Speed, Starburst’s suite of Trino accelerators, lets the compute engine run high-performance queries for the heaviest workloads. Features like smart indexing, caching, and materialized views reduce data in motion while delivering query results four times faster with a seventh of the compute resources.
Data discovery
Starburst Gravity is a discovery and governance layer that lets data teams manage all connected data assets from a single pane of glass. Schema discovery and data cataloging automate routine tasks when integrating new data sources.
Governance and security
Gravity also simplifies compliance and security management. Fine-grained access controls based on role-based and attribute-based rules provide better control of user permissions down to the row and column levels. In addition, Starburst APIs integrate with identity providers and other elements of existing security stacks.
Business benefits of an Icehouse architecture
Scalable, cost-effective object storage combined with the Iceberg open table format and Starburst’s enhanced Trino implementation creates a new data architecture called the Icehouse. This open data stack delivers the performance and functionality of legacy systems along with the affordable scalability of the open-source cloud.
Icehouse capabilities
The Icehouse rivals traditional data warehouses by combining enterprise-class data management capabilities with the scalability and optionality of cloud technologies. This combination gives an Icehouse four core capabilities:
Data ingestion: Icehouses efficiently ingest data through streaming and batch workflows.
Data governance: Icehouses control access based on where data lives, what it contains, and who makes the request.
Data management: Icehouses leverage Iceberg’s table compaction, schema evolution, and ACID compliance.
Automatic capacity management: Icehouse autoscaling resizes the Trino compute engine’s clusters dynamically and automatically optimizes storage utilization.
Benefits of Starburst Galaxy Icehouses
Galaxy’s fully managed Icehouse solution streamlines the deployment and management of this modern data architecture, while empowering data-driven decision-making cultures.
Enhanced Trino connectors for over fifty enterprise data sources let data teams quickly bring their systems within Galaxy’s unified interface. The automated discovery tools in Gravity alleviate mundane workloads and make these data sources available faster. From a single console, administrators quickly create the granular access controls needed to make this data accessible to authorized users.
Once deployed, a Starburst Galaxy Icehouse gives data teams the tools to manage streaming and batch pipelines, unstructured and structured data, as well as ad hoc and near-real-time analytics. The data enterprise becomes more efficient now that engineers can use Starburst to optimize the balance between query cost and performance.
A Starburst Galaxy Icehouse enables truly data-driven business cultures by putting data in users’ hands through Trino’s SQL queries. Anyone can access the data they need to make informed decisions, from analysts generating more insightful reports to data scientists developing innovative machine learning applications.