Why choose AWS for data lakes and analytics?
In many respects, AWS is a one-stop-shop for building an enterprise-class big data analytics architecture. Companies can select the combination of AWS and AWS Partner-provided services to create a data lake optimized for their specific business context.
6 benefits of AWS lake formation are:
1. Data portability and optionality
AWS supports standards-based file formats, including ORC and Parquet. Companies can build their data lake while retaining the ability to move their data to another service.
2. Scalability and resilience
Amazon Simple Storage Service (Amazon S3) is a massive, globally distributed object storage platform that instantly scales with customer demand. Each AWS region has three availability zones and multiple data centers that deliver extreme durability.
3. AWS Performance
As with Amazon S3’s storage scalability Amazon Elastic Compute Cloud (Amazon EC2) scales dynamically to meet any company’s compute workloads. Amazon EC2 lets you optimize instances for compute or memory-intensive operations as well as complex deep learning projects.
4. Amazon Security
Few companies can match Amazon’s security investment. Besides fully encrypting customer data at rest and in transit, Amazon’s machine learning-powered security systems automatically monitor for and report any unusual access activity.
5. S3 Affordability
Amazon S3 optimizes its customers’ storage costs by automatically assigning data to different storage tiers based on access patterns.
6. Advanced services
Companies can leverage the machine learning APIs and services Amazon developed for its own e-commerce and supply chain operations to build advanced AI-powered applications.
Build your data lakehouse (Open table formats, native security, reporting structure)
AWS S3 + Starburst query engine = data lakehouse analytics
Open table formats: Hive, Iceberg, Delta Lake, Hudi
The most compelling reason for building an AWS data lake is the ability it gives companies to leverage the principles of open source. Rather than getting locked into a traditional data warehouse vendor’s proprietary data format, companies can use open table formats like Hive, Iceberg, Delta Lake, and Hudi. These table formats add layers of enhanced metadata that compute engines need to deliver warehouse-like query performance.
Apache Hive
Hive was developed to make analytics on Hadoop-based architectures more accessible. The open-source project combines a table format and a centralized metadata file (the Metastore) with a query translation system that converts SQL statements into Hadoop’s less accessible MapReduce programming model.
Related reading: Hive vs Iceberg: Migrate your Hive tables to Iceberg
Apache Iceberg
Now an Apache open-source project, the Iceberg table format originated at Netflix, where Hive’s performance and management challenges couldn’t keep up with the streaming service’s exponentially growing datasets. Iceberg makes queries faster and more efficient through features like snapshot isolation, partitioning, and schema evolution. In addition, Iceberg tables enable use cases like time travel and ACID transactions.
Delta Lake
Not part of the Apache Software Foundation, Delta Lake began as the proprietary table format of the Databricks analytics platform. It is now community-developed, although largely by Databricks contributors. Delta Lake delivers modern features, including ACID transactions, time travel, and schema evolution. However, Delta Lake only supports the Parquet open data file format, somewhat limiting its usefulness for some companies.
Apache Hudi
Uber developed Hudi to add warehouse-style analytics to its Hadoop-based architecture. Now an open-source project, Apache Hudi combines incremental processing, ACID compliance, and a merge-on-read metadata table to bring near-real-time analytics to a data lake.
Related reading: Iceberg vs Hive vs Hudi vs Delta Lake
Feature / Specification | Apache Iceberg | Delta Lake | Apache Hudi | Apache Hive |
Transaction support (ACID) | Yes | Yes | Yes | Limited |
File format | Parquet, ORC, Avro | Parquet | Parquet, ORC, Avro | Parquet, ORC, Avro, and more |
Schema evolution | Full | Partial | Full | Partial |
Partition evolution | Yes | No | No | No |
Data versioning | Yes | Yes | Yes | No |
Time travel queries | Yes | Yes | Yes | No |
Concurrency control | Optimistic locking | Optimistic locking | Optimistic locking | Pessimistic locking |
Object store cost optimization | Yes | Yes | Yes | No |
Community and ecosystem | Growing | Growing | Growing | Established |
Open compute engines: Hive, Trino, Spark
The third element of a modern data lake, in addition to open file and table formats, is the open compute engine. Unlike the original Hadoop framework, these systems are optimized for processing large datasets in massively parallelized cloud environments. Frameworks like Hive, Trino, and Spark provide robust data management capabilities while also making data warehouse-like analytics more accessible on a data lake.
Apache Hive
As mentioned earlier, Hive combines analytics capabilities with an open table format. Hadoop’s MapReduce query system relies on a Java-like programming system that few analysts or data scientists know how to use. Hive’s query system, HiveQL, gives users an SQL-like interface for writing queries. The Hive runtime converts these SQL statements into MapReduce statements and returns the results. Although Hive simplifies Hadoop analytics, it does so at the expense of higher compute overhead and increased latency.
Related reading: Hive vs Iceberg: Migrate your Hive tables to Iceberg
Trino
A fork of Presto, Trino is a massively parallel query engine that can virtualize a company’s storage infrastructure. Not limited to Hadoop, Trino offers more than fifty connectors to federate enterprise data sources into a single access layer. Thanks to its use of ANSI-standard SQL and integrations with traditional business intelligence tools like Tableau and Power BI, Trino helps to democratize data access.
Apache Spark
Apache Spark emerged from an academic project to enhance big data analytics on Hadoop architectures. Spark SQL replaced MapReduce’s in-storage data processing with an in-memory approach that boosts performance while reducing compute expenses. Spark excels at reliable processing and transformations of data, particularly when used with machine learning.
Related reading: Trino vs Spark
What AWS services support Iceberg?
Building enterprise cloud compute infrastructure on AWS affords many benefits, not least of which is minimizing data movement in and out of the company’s AWS account. After ingestion into Amazon S3’s object storage, companies can process, analyze, and archive data without incurring expensive export fees.
An AWS data lake can take advantage of Iceberg’s performance and efficiency savings with an analytics platform consisting of Amazon Athena, Amazon EMR Trino, and AWS Glue.
Amazon Athena (serverless)
AWS created Amazon Athena by forking the Presto query engine to let the service provider’s customers perform ad-hoc analytics with standard SQL on data stored in Amazon S3. The serverless version replaces costly infrastructure management overhead with a pay-as-you-go model. Besides querying an AWS data lake, Athena can query data stored in Amazon Redshift data warehouses.
Related reading: How To Migrate Queries From Amazon Athena To Starburst Galaxy
Amazon EMR Trino
Amazon EMR lets companies build data lake analytics platforms that combine an AWS-native data management framework with an open compute engine like Trino. EMR handles all aspects of cluster management, from node provisioning to cluster tuning, letting data teams launch Trino clusters quickly. Interactive Trino queries can use EMR’s scaling and optimization features to run large, complex queries.
AWS Glue
AWS Glue is a serverless data integration service that acts as Amazon Athena’s primary data catalog. These AWS Glue data catalogs are searchable metadata stores for all assets in an S3 bucket. In addition to powering Athena queries, AWS Glue lets engineers build extract, transform, and load (ETL) or extract, load, and transform (ELT) data pipelines for data preparation.
Related reading:Using Apache Iceberg, AWS S3, and AWS Glue to manage a data lakehouse architecture
How do you integrate Apache Iceberg with AWS services?
Iceberg depends on the AWS v2 SDK, which you must provide, either with the AWS SDK bundle or an AWS client package like Glue or DynamoDB. The Spark and Flink runtimes include the iceberg-aws module.
Iceberg supports catalogs based on AWS Glue, DynamoDB, or RDS JDBC. Pairing Iceberg with Glue provides the most straightforward integration with AWS’s big data analytics services. Glue also integrates with IAM resources for better access control.
Amazon Athena, AWS Glue, Amazon S3 and Iceberg Demo
Iceberg’s performance and efficiency advantages let enterprises build modern data lakehouses that deliver the analytics capabilities of a data warehouse with open technologies that support structured and unstructured data types. More capable than Hive in today’s dynamic data environment, Iceberg has seen rapid adoption across industries.
In this video, Starburst Director of Customer Solutions Tom Nats does a quick tutorial on how Iceberg, Amazon S3, AWS Glue, Amazon Athena, and Starburst’s enhanced Trino implementation combine the best aspects of each technology. Tom uses Starburst Galaxy to create tables based on Iceberg and Parquet within an S3 bucket.
He then walks through how Iceberg creates new JSON, stats, and Avro metadata files for each snapshot and how Glue collects metadata for the Iceberg table, along with key-value pairs for the current and most recent snapshot metadata locations.
Tom concludes by showing how the data lake’s open structure doesn’t lock you into one technology — not even Starburst — by walking through how to access the table through Apache Spark.
Building a data lakehouse, Iceberg architecture, and data products
Starburst Galaxy is a modern data lakehouse analytics platform based on the open-source Trino query engine that federates enterprise data within a single point of access — without data movement. Galaxy’s unified interface and support for ANSI-standard SQL make data products accessible to the least technical users while letting more adept data consumers create complex queries with the SQL they already know. Democratizing access through Galaxy won’t compromise compliance, as role-based and attribute-based access controls let you enforce granular security and governance rules down to the row and column levels.
As an AWS Specialization Partner and AWS Marketplace Seller, Starburst is increasingly becoming a cornerstone of AWS data lakes. Companies leverage Starburst’s capabilities to expand data access, improve efficiency, and accelerate time to insight in their big data analytics.
One way Starburst customers combine AWS services, an Iceberg architecture, and Starburst’s analytics platform is by building data products — curated, reusable datasets targeting specific business problems. A product-based approach to data turns engineers into facilitators who operationalize the objectives of subject-matter experts who understand the data and the domain’s business priorities. Data products also provide usage metrics, data source lineage, and other information that increases users’ confidence in the product’s relevance, accuracy, and currency.
Starburst data products let users generate insights faster to deliver more informed, data-driven decisions by accessing data stored across your AWS infrastructure.
What are some next steps you can take?
Below are three ways you can continue your journey to accelerate data access at your company
- 1
- 2
Automate the Icehouse: Our fully-managed open lakehouse platform
- 3
Follow us on YouTube, LinkedIn, and X(Twitter).