Open Table Formats

An open table format provides a layer of abstraction on top of a data lake. This allows data to be managed and optimized more efficiently. At the same time, the increased structure allows for additional features.

To begin to understand open table formats, you begin your journey with Apache Hive. Apache Hive was the first data lake table format. However, this format had a number of functionality limitations owing to its architecture. In recent years, new table formats have pushed data lakes in new directions, increasing both functionality and performance.

Importantly, these table formats allow data lakes to achieve some of the efficiencies and compliance standards more typically associated with data warehouses or databases, while retaining the versatility of a data lake.

This includes enhanced ACID compliance, the ability to record transactional data efficiently, improved scalability, and the ability to update or delete records. These advancements are so significant that it is often said that data lakes using these technologies become more like data lakehouses, mixing the versatility of data lakes to handle raw and semi-structured data with the ability to process transactional workloads like a data warehouse.

Let’s explore how newer table formats have improved the versatility of data lakes by adding additional functionality. Particular attention will be paid to improvements in transactional data collection within data lakes.

Features of modern open table formats

While each table format varies, they all extend the features of a data lake in similar ways.

These include:

Full CRUD operations

Data lakes use either HDFS or object storage. Both hold data in an  immutable format. These have not typically provided an easy way to update files incrementally. If you consider that a database typically includes the ability to Create, Read, Update, or Delete (CRUD), a data lake has often only included the first two. Modern table formats help fix this by allowing the ability to update and delete records.

Improved performance and scalability

Data lakes tend to easily grow in size, and many are very, very large. With this, comes the need to scale their analytic capabilities to match. Newer table formats allow increased scalability when compared to Hive by introducing a new way of recording data at the file level. This is a marked improvement over Hive’s record keeping approach, which organized data by folder. This means that if a query requires data on a specific subset of data, it can search the specific files containing the information rather than searching the whole folder. This vastly improves performance and efficiency.

Transactional support and ACID compliance

With the inclusion of ACID capabilities in table formats like Iceberg and Delta Lake, users can now achieve a level of transactional awareness within a data lake. This does not necessarily mean that a data lake would be a replacement for an OLTP system. However, it does ensure that groups of updates transactionally complete together or are rolled back if they cannot be completed. This is helpful with some of today’s evolving ETL pipelines.

3 Types of modern open table formats

We take a closer look at modern open table formats: Apache Iceberg, Delta Lake, and Apache Hudi below.

1. Apache Iceberg

Apache Iceberg is an open source table format used to structure the data held in data lakes. Like the other table formats listed, it was developed to solve the challenges of performance, data modification, and CRUD operations in the data lake. It can be used with either HDFS or any object-based cloud platform, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.

Notably, Iceberg also offers schema evolution, schema partitioning, and time travel. This allows users to apply and update schemas, apply and update partitions, and enact version control to roll back changes to a system to a previous state. All of these adaptations push the data lake to a new level of functionality and create new use-cases for data lakes.

Demo: Iceberg and Trino

In this exciting exploration, we’re delving into the powerful combination of Apache Iceberg and Trino, two dynamic tools reshaping the landscape of big data. To do this, we’ll use Starburst Galaxy and compare its use to AWS Athena.

2. Delta Lake

Delta Lake is an open source framework developed by Databricks. Like other modern table formats, it employs file-level listings, improving the speed of queries considerably compared to the  directory-level listing of Hive.

Like Iceberg, Delta Lake offers enhanced CRUD operations, including the ability to update and delete records in a data lake which would previously have been immutable. It is ACID compliant and often used in transactional systems. This use-case makes data lakes a viable replacement for traditional transactional databases, while retaining the cost and storage benefits of other data lakes.

Delta Lake can be used with Starburst via the Delta Lake connector.

3. Apache Hudi

Apache Hudi is another table format, which is used less often than Iceberg or Delta Lake. It addresses some of the same problems discussed above.

Open table format architecture

Metadata captures changes in state

Architecturally, modern table formats are composed of a set of hierarchically structured metadata files. These files capture changes in the state of the data in the data lake. You can think of a table format as a kind of database transaction log, outlining all of the changes over the life of the data lake. This metadata  is stored in a structured, readily accessible format.

How does this work? Let’s explore how Iceberg uses enhanced metadata collection to deliver additional functionality.

The image below shows how metadata tracks the changes to the dataset. The files held in the Data layer are captured by the metadata files held in the Metadata layer. As the files change, the metadata files attached to them track these changes.

Record snapshots

To achieve this, modern table formats create records pointing to individual metadata file locations. This file is known as a manifest file, and includes metadata containing information about the table at a given point in time. This acts as a kind of snapshot of the table, detailing the points at which a change is made. Multiple manifest files are stored in Manifest lists.

In the image below, changes in the Data layer have been detected in the Metadata layer. A new Manifest file and corresponding Manifest list has been created to capture these changes.

Create up-to-date record of changes

Manifest files and Manifest lists provide the ability to record an accurate, up-to-date account of the changes that have occurred over time. This includes inserts, deletions, updates, schema migrations, and partition changes. The changes themselves are stored in Metadata files known as Snapshots. Each snapshot is like a slice in time, allowing the dataset to be queried as it was multiple instances or rolled back to a previous state.

How does this work?

The image below shows how the changes in the Data layer have created a new Snapshot file, Snapshot 1. The original Snapshot file, Snapshot 0, is also retained. This creates a series of snapshots, each mapping changes to the data and recording those changes in the Metadata layer.

Open table formats vs Open file formats

Table and file formats are different open-source elements of an open data lakehouse. Columnar open file formats like Parquet and ORC ensure data within an object gets written in ways that optimize query performance; while open table formats like Iceberg sit above the files and objects, providing a layer of rich metadata to enable analytics on the underlying data lake.

Open table format feature Apache Iceberg Delta Lake Apache Hudi Apache Hive
Transaction support (ACID) Yes Yes Yes Limited 
File format Parquet, ORC, Avro Parquet Parquet, ORC, Avro Parquet, ORC, Avro, and more
Schema evolution Full Partial Full Partial 
Partition evolution Yes No No No
Data versioning Yes Yes Yes No
Time travel queries Yes Yes Yes No
Concurrency control Optimistic locking Optimistic locking Optimistic locking Pessimistic locking
Object store cost optimization Yes Yes Yes No
Community and ecosystem Growing  Growing Growing Established

Choosing the right open table formats based on the vendor

The data industry has woken up to the reality that direct access to the underlying files that store their data should be open. Storing data in open formats, specifically Apache Iceberg, in an object storage lake, has enabled this. This is ultimately good news for their customers and helps to reduce vendor lock-in.

Starburst was founded on the idea that data should always belong to the customer. With the announcements — Databricks acquires Tabular, founded by Apache Iceberg creators; Snowflake unveils Polaris, an open-source implementation of the Iceberg REST catalog — there is now a huge potential to improve price-performance by being able to choose the best engine for the job. Databricks and Snowflake have finally recognized this basic customer need.

But, Starburst believes in optionality, only made possible with a truly open data lakehouse, which is also known as the Icehouse. An Icehouse architecture is based on Trino for high-performance and scale SQL querying (read and write) and Iceberg for storage; it is complimentary to your Snowflake + Iceberg solution and can help you significantly lower your operating costs.

This cheat sheet will help you easily compare open table formats in one view, so that you can make the most informed decision on which table format is right for your business.