Open data lakehouse architectures address the growing need for warehouse-like analytics that works at scale with different data formats and disparate data sources. Turning to open source lets companies build an affordable yet performant analytics infrastructure that keeps pace with data’s rapid growth.
This article will introduce the open data lakehouse concept and its use cases, as well as explain how Apache Iceberg and Trino combine to form a powerful analytics resource.
What are the key components of an open data lakehouse platform?
An open data lakehouse platform builds upon commercial cloud object storage services but otherwise draws from the open-source ecosystem to construct a scalable, performant analytics solution. The three open-source components are file formats, table formats, and compute engines.
Commodity object storage
Amazon S3, Azure Blob Storage, and other cloud platforms provide commodity data storage at petabyte scales. By decoupling storage from compute, data teams can optimize the costs and performance of each independently. In addition, these cloud platforms are also more flexible, scalable, and affordable than on-premises storage infrastructure.
Open file format
Open file formats define how a lakehouse writes and reads data. Columnar file formats structure data in ways that enhance query performance, providing detailed metadata and indexes that queries use to skip irrelevant data. Examples of open file formats include Apache Parquet and ORC.
Open table format
Open table formats add an abstraction layer to a data lake’s sparse metadata, creating warehouse-like storage with structured and unstructured data. Table formats define the schema and partitions of every table and describe the files they contain. By providing a separate reference for this information, tables let queries avoid opening every file’s header and instead go to the most relevant files. Delta Lake and Apache Iceberg are commonly used open table formats.
Open compute engine
Open compute engines are the lakehouse components that elevate big data analytics far beyond a conventional warehouse’s capabilities. Designed for massive parallelization in cloud environments, these compute engines can process large datasets quickly while balancing compute costs. As a result, lakehouses can support streaming ingestion and near real-time analytics, at the same time giving data consumers access to database-like functionality. Frequently used open compute engines include Trino and Apache Spark.
Open Source Trino
Many organizations use Trino with their data lakehouses because its massively parallel, distributed SQL query engine provides a unique combination of performance, cost-effectiveness, and accessibility. Facebook originally developed Trino to improve query results in the Hadoop ecosystem, but it now works on multiple data platforms.
Trino use cases
Common Trino use cases include:
Interactive data analytics
Unlike the proprietary implementation of a data warehouse, Trino uses ANSI-standard SQL to maximize analytics accessibility. SQL-compatible business intelligence applications like Tableau can easily return data from the lakehouse. Exploration and data extraction becomes easier when scientists can write standard SQL statements in Python or other programming languages. Engineers can quickly develop dashboards and other data products for the least technical users. Up and down the organization, Trino lets data consumers conduct interactive analysis of large datasets.
Centralized data access and federated analytics
Trino connectors eliminate data silos by federating data sources across the company so a single Trino query can access data lakes, relational databases, and streaming sources. Besides streamlining query development, this federation lets data teams optimize lakehouse storage for frequently accessed data without isolating potentially valuable data held in other systems.
High performance analytics of object storage
Parallelization at scale and query optimizations make Trino an ideal solution for performing big data analytics on object storage. Trino can push queries down to source systems to reduce compute costs and leverage the source’s indexes. Dynamic filtering lets Trino skip data that the query would end up filtering. And a cost-based optimizer distributes compute loads to balance cost with performance.
Batch ETL processing across disparate systems
Data ingestion and other workflows that require complex ETL pipelines typically run in overnight batches because they take so long and consume significant resources. Trino accelerates and simplifies these pipelines by letting engineers use standard SQL statements within a single system to query multiple data sources. Besides streamlining pipelines, faster batch processing speeds an ad hoc research project’s time to insight.
Why Apache Iceberg? Why open table formats?
Apache Iceberg is a highly-performant open table format that brings data warehousing functionality to a data lake’s object storage and integrates with modern query engines like Trino.
Developers at Netflix created Iceberg’s table format to address the challenges of working with Hive. Basic data management, like deletes and updates, had become increasingly difficult, as had meeting data governance demands. A single change required overwriting huge datasets.
4 Benefits of open table formats and Apache Iceberg
Open table formats like Iceberg can better meet modern analytics needs than older technologies like Hive. Some benefits include:
1. Central table storage
An Iceberg table’s catalog provides a central starting place for queries to find metadata without accessing files individually. The catalog points to table metadata files where the query can find schema, file metadata, and other information.
2. Access control
Open table formats like Iceberg leave access control to the open compute engines, which leverage table metadata to secure access, protect data, and apply privacy rules.
3. Enables portable compute
Open table formats help eliminate vendor lock-in and make data more portable. They work with multiple compute engines, so companies are not tied to one vendor’s product.
4. Schema evolution
Open table formats are also less sensitive to how data changes over time. Schema evolution lets these tables evolve without requiring massive — and expensive — rewrites.
What are the key benefits of implementing an open data lakehouse architecture?
An open data lakehouse architecture combines the benefits of data warehouses and data lakes into a single, performant analytics platform:
Data warehouse benefits |
Data lake benefits |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
How does an open data lakehouse integrate with existing data platforms?
Trino does more than power the analytics capabilities of data lakehouses. By federating data sources beyond the lakehouse’s object storage, Trino makes an enterprise’s entire data infrastructure part of the data lakehouse to create a virtual repository anyone can access.
Why does AI need an open data lakehouse architecture?
Artificial intelligence and machine learning algorithms have become critical sources of competitive advantage. Leveraging these data science tools generates innovative products and streamlines business processes. Increasingly, AI/ML projects run into the limitations of traditional data warehouses and data lakes.
Data scientists need more varied datasets than warehouses can provide. A reliance on structured data misses key patterns and insights within vast quantities of a company’s unstructured data.
Data lakes can hold this varied data, but the objects they store have limited metadata compared to the files in a warehouse. Furthermore, lakes can’t match a data warehouse’s rich exploration and discovery resources.
Whether pulling data from other sources to compensate for a warehouse’s limitations or extracting data from a data lake’s repository, a data science project will require complex ETL pipelines and extensive data engineering resources.
An open data lakehouse provides a high-performance solution for big data processing. Scientists can easily explore structured and unstructured datasets thanks to rich metadata and powerful query engines. These accessible tools reduce the reliance on over-tasked data teams, speeding the development of new artificial intelligence applications.
Lakehouse a new generation of open platforms
Starburst Galaxy uses Trino to improve modern data lakehouse architectures, becoming a single point of access to all data formats stored in any data source. Galaxy’s enhancements, like smart indexing and caching, accelerate query performance. Granular access controls provide the granular governance enforcement needed to deliver a self-service model that makes data more accessible and secure.
Starburst Galaxy lets companies implement Trino on their data lakehouses without worrying about the open-source software’s operational aspects.