Hive vs Iceberg: Choosing the best table format for your analytics workload

Hive to Iceberg Migration

May 16, 2023

Yusuf Cattaneo

Solutions Architect

Starburst

Emma Lullo

Senior Product Marketing Manager, Starburst Galaxy

Starburst

Yusuf Cattaneo

Solutions Architect

Starburst

Emma Lullo

Senior Product Marketing Manager, Starburst Galaxy

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

The Icehouse Manifesto: Building an Open Lakehouse

This post is part of the Iceberg blog series. Read the entire series:

Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities.

In this blog post, we’ll walk through the differences between Hive and Iceberg, the use cases for both formats, and how to start planning your migration strategy.

What is Apache Hive?

Apache Hive is open-source data warehouse software project built on top of Apache Hadoop to provide data query and analysis capabilities via a SQL-like interface. Hive supports storage on AWS S3, ADLS, and GCS through the Hadoop Distributed File System (HDFS). With Hive, non-programmers familiar with SQL can read, write, and manage petabytes of big data.

Apache Hive architecture

There are four main components of Apache Hive:

Driver – The component that receives queries
Compiler – The component that parses queries
Metastore – The component that stores all the structure information of the various tables and partitions
Execution Engine – The component that executes the execution plan created by the compiler

For the purposes of comparison to Apache Iceberg, we will strictly be focusing on the Hive data model and the Hive metastore.

Data in Hive is organized into tables similar to a relational database and data about each table is stored in a directory in HDFS. The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions that operates independently of Apache Hive. The HMS has become a building block for data lakes providing critical data abstraction and data discovery capabilities.

Challenges of Apache Hive

The majority of the challenges associated with Apache Hive stem from the fact that data in tables is tracked at the folder level. This leads to several challenges including:

Slow file list operations: Each time you query data in Hive, the directories need to perform a file list which gets expensive and slow on datasets with many partitions.
Inefficient DML: If you’re updating and deleting data frequently, you may experience high latency as Hive requires you to replace the entire file
Costly schema changes: If you change your schema in Hive, you have to rewrite the entire data set which is costly and time intensive.
No transaction support: It is not possible to guarantee data consistency and integrity for transactions since Hive is not ACID-compliant by default. While Hive offers optional Hive ACID transactional (version-less) tables, it is not universally or consistently supported by major SQL engines.

What is Apache Iceberg?

Apache Iceberg is an open table format that was designed with modern cloud infrastructure in mind. It was created at Netflix to overcome the limitations of Apache Hive and includes key features like efficient updates and deletes, snapshot isolation, and partitioning.

Check out Ryan Blue’s talk on creating Apache Iceberg table format at Netflix here.

Apache Iceberg architecture

As Tom Nats mentions in his “Introduction to Apache Iceberg in Trino” blog, Apache Iceberg is made up of three layers:

The Iceberg catalog
The metadata layer
The data layer

As you can see, Iceberg defines the data in the table at the file level, rather than a table pointing to a directory or a set of directories.

Advantages of Apache Iceberg

Apache Iceberg brings new capabilities to the data lake – including warehouse-like DML capabilities and data consistency. Specifically, Apache Iceberg offers the following advantages:

Fast snapshots: Snapshots eliminate costly and slow directory listings by allowing the engine to read straight from the metadata.
Efficient DML: Iceberg allows for full DML support on cloud storage.
In-Place Schema Changes: Iceberg supports in-place schema evolution meaning you can evolve table schema without costly rewrites.
Transactions: Iceberg provides ACID-compliant versioning, which means that data consistency and integrity are ensured for all transactions, and functions consistently across SQL engines.

Webinar: Hive to Iceberg Data Lakehouse

Migrating your Hive tables to Iceberg might seem like a quick fix for turning your data lake into a lakehouse, but it can create more problems than it solves when not done correctly. This webinar will compare and contrast the architectures of Apache Hive and Apache Iceberg, as well as walk through examples of when migrations would or would not be helpful.

Learn more

What is the difference between Hive and Iceberg?

Now that we have looked at the architecture for Hive and Iceberg, we understand that both are efficient technologies for querying large datasets. The choice depends on the requirements of your use case.

Let’s look at how the capabilities of the two compare:

	Hive Tables	Iceberg Tables
Open source	Yes	Yes
Read object storage using SQL	Yes	Yes
File format	Parquet, Orc, Avro	Parquet, Orc, Avro
Performant at scale	Yes	Yes
ACID transactions	No	Yes
Table versioning	No	Yes
Time travel	No	Yes
Schema evolution	No	Yes
Partition evolution	No	Yes
Partition pruning	Yes	Yes

As you can see, Iceberg unlocks traditional data warehousing capabilities on cost-effective cloud storage.

When to Migrate to Apache Iceberg

I’m often asked by my customers and prospects “when should I consider migrating to Apache Iceberg?” While I do prefer working with Apache Iceberg, the decision to migrate is not one that should be taken lightly. It oftentimes requires several months of experimenting to structure your Iceberg tables correctly and optimize your queries.

That’s why I tend to take a use case based approach when considering the value of migration. If you are looking to run more use cases directly from your object storage (including one of the following), I would highly recommend exploring the migration – and Starburst can help.

4 Common use cases for migrating to Apache Iceberg

1. Data Applications

Organizations building latency-sensitive data applications on top of cloud object storage

2. Collaborative Workflows

Iceberg enables collaborative data workflows by providing a shared and consistent data representation at all times

3. Root Cause Analysis

Organizations performing historical analysis or root cause analysis by using time travel

4. Compliance

Organizations that require data in object storage to be easily modifiable for GDPR compliance

In summary

Iceberg can help leave situations like the below in the past (via its snapshot and time-travel capabilities):

If you’re exploring a potential migration, check out our migration tutorial or get free migration guidance from our experts.

Tutorial

Migrate Hive tables to Apache Iceberg with Starburst Galaxy

Take the tutorial