Cloud Data Lakehouse

It gives companies the data analytics benefits of a traditional data warehouse while letting them leverage the affordable scalability of commodity cloud storage providers.

An example of a cloud data lakehouse architecture uses open table formats like Iceberg and commodity object storage services like Amazon S3 to create cost-effective, scalable enterprise data analytics platforms that power data-driven decision-making.

Cloud data lakehouse architecture vs On-premises

Let’s take a closer look at the cloud data lakehouse concept and its advantages over on-premises data warehouses before offering ways to implement an effective cloud data lakehouse solution.

Lakehouse architectures are better suited for how modern enterprise create, store, and use data. As a result, lakehouses yield significant advantages over traditional on-premises data warehouses or more recent cloud data warehouse offerings.

Cloud data lakehouse

On-premise

Dynamic scalability

Cloud platforms scale both storage and compute to match the instantaneous needs of a company’s analytics activities. On-premises data infrastructure works best with predictable baseline workloads. However, they are not responsive enough to meet the decision-making needs of a data-driven organization. 

Enterprise-wide accessibility

Cloud data lakehouses increase data accessibility by unifying global enterprise data sources within a single platform. Relational databases, data warehouses, and data lakes become part of the lakehouse’s virtualized data repository, allowing authorized users to access larger, more diverse datasets. This isn’t nearly as possible with an on-premises data center.

Financial efficiency

Building data repositories on the low-cost storage of service providers like Amazon Web Services or Azure eliminates the significant capital expenses associated with on-premises data centers. 

Finally, the scalability of cloud data storage lets companies pay for the services they use rather than investing in under-utilized capacity.

Operational expenses can be higher when maintaining physical infrastructure.

Data portability

Cloud data lakehouses also help companies avoid vendor lock-in. Companies can use open-source technologies like Iceberg with their preferred cloud service provider. Organizations may feel that their data can be trapped in the proprietary formats of traditional data warehouse providers

Lakehouse file formats and table formats

Cloud data lakehouses use object storage services like Amazon S3 and Azure Blob Storage to manage structured, semi-structured, and unstructured data efficiently. However, these services do not provide the necessary metadata to run enterprise-class analytics. Open file and table formats add metadata layers query engines need to find and read data efficiently. Data teams can leverage the robust, battle-tested open-source ecosystem to apply the most appropriate technologies to their data architectures.

Open file formats: Columnar formats like Apache Parquet and ORC specify how the lakehouse writes data to storage. Writing column values sequentially allows for more efficient compression and lets queries find the data they need in one place. Open file formats provide data structure, storage locations, and other metadata that queries can use to avoid unnecessary reads and complete their jobs faster.

Open table formats: Table formats like Apache Iceberg and Delta Lake enable warehouse-like data management functionality like time travel, ACID transactions, and schema enforcement.

Related reading: Azure data lakehouse

Cloud data lakehouse use cases

Cloud data lakehouses use high-performance query engines to streamline analytics use cases by making it easier for data consumers at all levels to access enterprise data.

The least technical users get reliable access to dashboards and reports with high data quality and currency. Behind the scenes, data engineering teams spend less time developing and maintaining the data pipelines that feed these resources.

Analysts can use these query engines either directly or through integrations with the business intelligence apps they already know. By eliminating data silos, lakehouses give analysts access to more datasets from which they can generate richer insights to inform decision-makers.

Data scientists spend less time developing advanced analytics applications like machine learning and artificial intelligence. They can use the lakehouse’s query engine to explore data sources and perform the interactive analysis needed to build their models with less reliance on data teams.

Implementing a cloud data lakehouse expands data access while reducing the mundane demands on a company’s data engineering teams. Besides reducing users’ reliance on engineers, the lakehouse’s combination of open technologies and standards-based query engines streamlines data management workflows.

Related reading: Reduce cloud data warehouse costs

8 data lakehouse tips

The advantages of this analytics architecture impact everyone in the enterprise by making data easier to find, process, and use to drive the business forward. Here are a few tips for making the most out of a cloud data lakehouse.

1. Optimize query performance

Using open technologies to build a lakehouse lets you choose the most appropriate query engines for your analytics and processing workloads. Query engines like Trino or Apache Spark have different strengths. For example, Trino’s massively parallel processing capabilities deliver petabyte-scale performance while also using ANSI-standard SQL for universal accessibility.

2. Decouple compute and storage

The lakehouse’s scaling and efficiency benefits come from how its decentralized architecture decouples your compute and storage infrastructures. Cheap object storage lets you consolidate large volumes of data independently from compute which can scale up and down with demand.

3. Federate enterprise data sources

While centralizing data maximizes analytics performance, no organization will ever have a single source of truth for all its data. Some data simply can’t be consolidated. A cloud data lakehouse should have connectors that bring these distributed data sources within the platform’s unified interface, thus eliminating data silos and maximizing access.

4. Look for reliability features

Data lakehouses’ critical role in the company’s data-driven decision-making demands reliability. Using table formats that support features like time travel and rollback will improve recoverability from accidental or malicious data events. Query engines with built-in fault tolerance improve the reliability of large batch-processing pipelines.

5. Evaluate security features

Unifying enterprise data sources and expanding access is essential to the business but cannot come at the expense of security. Cloud data lakehouses should offer role and attribute-based access controls as well as integrations with enterprise authentication systems that let you enforce granular data access policies.

6. Integrate batch and streaming data pipelines

Data warehouses typically must wait for hour-long or overnight batches to update their datasets. Data lakehouses go beyond traditional ETL data pipelines to enable real-time data ingestion from operational and transactional systems. Starburst Galaxy, for example, can ingest data from a Kafka-compliant topic into the data lakehouse, letting companies process and analyze data in near-real-time within a single system.

7. Data lakehouse governance

Going beyond cybersecurity defense, lakehouses can become the primary method for enforcing internal and external data governance policies. For instance, fine-grained access controls and rich metadata simplify the implementation of regional data privacy regulations in a company’s globally-accessible data infrastructure. Starburst’s governance layer, Starburst Gravity, can work across regions and clusters to ensure appropriate access.

8. Performance and optimization for the data lakehouse

Data lakehouses include performance and management features most often associated data warehousing. Automating common tasks like compaction, retention, and vacuuming further alleviates the data team’s daily workloads. Query engines like Trino support automated query planning, pushdown queries, dynamic filters, and other performance-enhancing features. Starburst’s data lakehouse platform provides cached views and workload accelerations for further performance optimization.

Terminology related to cloud data lakehouse

The world of enterprise data has more than its share of buzzwords. Here’s a quick comparison of the cloud data lakehouse concept to other popular data concepts.

Data lakehouse vs data mart

Data marts are curated data warehouses, subsets of the enterprise data warehouse, built around a business unit’s needs. Although simpler for business users, they add to the data team’s maintenance workloads and create risk through data duplication.

A better approach is to build data products on top of a data lakehouse. Easier-to-maintain ELT pipelines transform the raw data on-demand to fulfill the business unit’s requirements.

Data lakehouse vs data modeling

A significant drawback of the data warehouse paradigm is the need for structure. Architects must model everything about the data — its sources, transformations, storage, and use — before creating the warehouse. Later changes to schema become difficult and expensive.

Since data lakehouses store raw data, there’s no need for upfront data modeling. Each new business application and data science project can model data uniquely without impacting other data users.

Data lakehouse vs data virtualization

Data virtualization is part of what distinguishes a data lakehouse from a data lake. The latter can store huge amounts of data but cannot contain data from every source. A cloud data lakehouse platform like Starburst has dozens of connectors that virtualize enterprise data sources, allowing access from within a single point of access.

Related reading Data virtualization will become a core component of data lakehouses

Data lakehouse vs data fabric

Data fabrics leverage multiple AI-mediated data virtualization and ingestion techniques to create a unified data architecture over a data warehouse or data lake that operational or analytical systems can then use. A modern data lakehouse solution like Starburst provides this virtualized integration within its advanced analytics platform.

Related reading: Data mesh vs data fabric