Combining AWS services with Apache Iceberg tables lets companies build powerful, cost-effective data lakes

Guiding you through the benefits of building AWS data lakes for big data analytics and how to integrate this architecture with Starburst’s open data lakehouse analytics solution.
Strategy
  • Cindy Ng

    Cindy Ng

    Sr. Manager, Content

    Starburst

Share

Why choose AWS for data lakes and analytics?

In many respects, AWS is a one-stop-shop for building an enterprise-class big data analytics architecture. Companies can select the combination of AWS and AWS Partner-provided services to create a data lake optimized for their specific business context. 

6 benefits of AWS lake formation are:

1. Data portability and optionality

AWS supports standards-based file formats, including ORC and Parquet. Companies can build their data lake while retaining the ability to move their data to another service.

2. Scalability and resilience

Amazon Simple Storage Service (Amazon S3) is a massive, globally distributed object storage platform that instantly scales with customer demand. Each AWS region has three availability zones and multiple data centers that deliver extreme durability.

3. AWS Performance

As with Amazon S3’s storage scalability Amazon Elastic Compute Cloud (Amazon EC2) scales dynamically to meet any company’s compute workloads. Amazon EC2 lets you optimize instances for compute or memory-intensive operations as well as complex deep learning projects.

4. Amazon Security

Few companies can match Amazon’s security investment. Besides fully encrypting customer data at rest and in transit, Amazon’s machine learning-powered security systems automatically monitor for and report any unusual access activity.

5. S3 Affordability

Amazon S3 optimizes its customers’ storage costs by automatically assigning data to different storage tiers based on access patterns.

6. Advanced services

Companies can leverage the machine learning APIs and services Amazon developed for its own e-commerce and supply chain operations to build advanced AI-powered applications.

Build your data lakehouse (Open table formats, native security, reporting structure)

AWS S3 + Starburst query engine = data lakehouse analytics

Open table formats: Hive, Iceberg, Delta Lake, Hudi

The most compelling reason for building an AWS data lake is the ability it gives companies to leverage the principles of open source. Rather than getting locked into a traditional data warehouse vendor’s proprietary data format, companies can use open table formats like Hive, Iceberg, Delta Lake, and Hudi. These table formats add layers of enhanced metadata that compute engines need to deliver warehouse-like query performance.

Apache Hive

Hive was developed to make analytics on Hadoop-based architectures more accessible. The open-source project combines a table format and a centralized metadata file (the Metastore) with a query translation system that converts SQL statements into Hadoop’s less accessible MapReduce programming model.

Related reading: Hive vs Iceberg: Migrate your Hive tables to Iceberg

Apache Iceberg

Now an Apache open-source project, the Iceberg table format originated at Netflix, where Hive’s performance and management challenges couldn’t keep up with the streaming service’s exponentially growing datasets. Iceberg makes queries faster and more efficient through features like snapshot isolation, partitioning, and schema evolution. In addition, Iceberg tables enable use cases like time travel and ACID transactions.

Delta Lake

Not part of the Apache Software Foundation, Delta Lake began as the proprietary table format of the Databricks analytics platform. It is now community-developed, although largely by Databricks contributors. Delta Lake delivers modern features, including ACID transactions, time travel, and schema evolution. However, Delta Lake only supports the Parquet open data file format, somewhat limiting its usefulness for some companies.

Apache Hudi

Uber developed Hudi to add warehouse-style analytics to its Hadoop-based architecture. Now an open-source project, Apache Hudi combines incremental processing, ACID compliance, and a merge-on-read metadata table to bring near-real-time analytics to a data lake.

Related reading: Iceberg vs Hive vs Hudi vs Delta Lake

Feature / Specification Apache Iceberg Delta Lake Apache Hudi Apache Hive
Transaction support (ACID) Yes Yes Yes Limited 
File format Parquet, ORC, Avro Parquet Parquet, ORC, Avro Parquet, ORC, Avro, and more
Schema evolution Full Partial Full Partial 
Partition evolution Yes No No No
Data versioning Yes Yes Yes No
Time travel queries Yes Yes Yes No
Concurrency control Optimistic locking Optimistic locking Optimistic locking Pessimistic locking
Object store cost optimization Yes Yes Yes No
Community and ecosystem Growing  Growing Growing Established

Open compute engines: Hive, Trino, Spark

The third element of a modern data lake, in addition to open file and table formats, is the open compute engine. Unlike the original Hadoop framework, these systems are optimized for processing large datasets in massively parallelized cloud environments. Frameworks like Hive, Trino, and Spark provide robust data management capabilities while also making data warehouse-like analytics more accessible on a data lake.

Apache Hive

As mentioned earlier, Hive combines analytics capabilities with an open table format. Hadoop’s MapReduce query system relies on a Java-like programming system that few analysts or data scientists know how to use. Hive’s query system, HiveQL, gives users an SQL-like interface for writing queries. The Hive runtime converts these SQL statements into MapReduce statements and returns the results. Although Hive simplifies Hadoop analytics, it does so at the expense of higher compute overhead and increased latency.

Related reading: Hive vs Iceberg: Migrate your Hive tables to Iceberg

Trino

A fork of Presto, Trino is a massively parallel query engine that can virtualize a company’s storage infrastructure. Not limited to Hadoop, Trino offers more than fifty connectors to federate enterprise data sources into a single access layer. Thanks to its use of ANSI-standard SQL and integrations with traditional business intelligence tools like Tableau and Power BI, Trino helps to democratize data access.

Apache Spark

Apache Spark emerged from an academic project to enhance big data analytics on Hadoop architectures. Spark SQL replaced MapReduce’s in-storage data processing with an in-memory approach that boosts performance while reducing compute expenses. Spark excels at reliable processing and transformations of data, particularly when used with machine learning.

Related reading: Trino vs Spark

What AWS services support Iceberg?

Building enterprise cloud compute infrastructure on AWS affords many benefits, not least of which is minimizing data movement in and out of the company’s AWS account. After ingestion into Amazon S3’s object storage, companies can process, analyze, and archive data without incurring expensive export fees.

An AWS data lake can take advantage of Iceberg’s performance and efficiency savings with an analytics platform consisting of Amazon Athena, Amazon EMR Trino, and AWS Glue.

Amazon Athena (serverless)

AWS created Amazon Athena by forking the Presto query engine to let the service provider’s customers perform ad-hoc analytics with standard SQL on data stored in Amazon S3. The serverless version replaces costly infrastructure management overhead with a pay-as-you-go model. Besides querying an AWS data lake, Athena can query data stored in Amazon Redshift data warehouses.

Related reading: How To Migrate Queries From Amazon Athena To Starburst Galaxy

Amazon EMR Trino

Amazon EMR lets companies build data lake analytics platforms that combine an AWS-native data management framework with an open compute engine like Trino. EMR handles all aspects of cluster management, from node provisioning to cluster tuning, letting data teams launch Trino clusters quickly. Interactive Trino queries can use EMR’s scaling and optimization features to run large, complex queries.

AWS Glue

AWS Glue is a serverless data integration service that acts as Amazon Athena’s primary data catalog. These AWS Glue data catalogs are searchable metadata stores for all assets in an S3 bucket. In addition to powering Athena queries, AWS Glue lets engineers build extract, transform, and load (ETL) or extract, load, and transform (ELT) data pipelines for data preparation.

Related reading:Using Apache Iceberg, AWS S3, and AWS Glue to manage a data lakehouse architecture

How do you integrate Apache Iceberg with AWS services?

Iceberg depends on the AWS v2 SDK, which you must provide, either with the AWS SDK bundle or an AWS client package like Glue or DynamoDB. The Spark and Flink runtimes include the iceberg-aws module.

Iceberg supports catalogs based on AWS Glue, DynamoDB, or RDS JDBC. Pairing Iceberg with Glue provides the most straightforward integration with AWS’s big data analytics services. Glue also integrates with IAM resources for better access control.

Amazon Athena, AWS Glue, Amazon S3 and Iceberg Demo

Iceberg’s performance and efficiency advantages let enterprises build modern data lakehouses that deliver the analytics capabilities of a data warehouse with open technologies that support structured and unstructured data types. More capable than Hive in today’s dynamic data environment, Iceberg has seen rapid adoption across industries.

In this video, Starburst Director of Customer Solutions Tom Nats does a quick tutorial on how Iceberg, Amazon S3, AWS Glue, Amazon Athena, and Starburst’s enhanced Trino implementation combine the best aspects of each technology. Tom uses Starburst Galaxy to create tables based on Iceberg and Parquet within an S3 bucket.

He then walks through how Iceberg creates new JSON, stats, and Avro metadata files for each snapshot and how Glue collects metadata for the Iceberg table, along with key-value pairs for the current and most recent snapshot metadata locations.

Tom concludes by showing how the data lake’s open structure doesn’t lock you into one technology — not even Starburst — by walking through how to access the table through Apache Spark.

Building a data lakehouse, Iceberg architecture, and data products

Starburst Galaxy is a modern data lakehouse analytics platform based on the open-source Trino query engine that federates enterprise data within a single point of access — without data movement. Galaxy’s unified interface and support for ANSI-standard SQL make data products accessible to the least technical users while letting more adept data consumers create complex queries with the SQL they already know. Democratizing access through Galaxy won’t compromise compliance, as role-based and attribute-based access controls let you enforce granular security and governance rules down to the row and column levels.

As an AWS Specialization Partner and AWS Marketplace Seller, Starburst is increasingly becoming a cornerstone of AWS data lakes. Companies leverage Starburst’s capabilities to expand data access, improve efficiency, and accelerate time to insight in their big data analytics.

One way Starburst customers combine AWS services, an Iceberg architecture, and Starburst’s analytics platform is by building data products — curated, reusable datasets targeting specific business problems. A product-based approach to data turns engineers into facilitators who operationalize the objectives of subject-matter experts who understand the data and the domain’s business priorities. Data products also provide usage metrics, data source lineage, and other information that increases users’ confidence in the product’s relevance, accuracy, and currency.

Starburst data products let users generate insights faster to deliver more informed, data-driven decisions by accessing data stored across your AWS infrastructure.