Last Updated: 2024-03-22
Apache Hive has long been a popular choice for storing and processing large amounts of data in Hadoop environments. However, as data engineering requirements have evolved, new technologies have emerged that offer improved performance, flexibility, and workload capabilities. One such technology is Iceberg, an open table format.
In this tutorial, we'll walk through the rationale behind migrating from Hive to Iceberg, the steps needed to complete a successful migration, and some of the key benefits of doing so. Two migration methods will be shown, and the advantages of each method will be discussed.
Upon successful completion of this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
Before diving into the steps in this tutorial, let's review the basic definitions of Hive and Iceberg and compare their features and capabilities.
Apache Hive is open-source data warehouse software designed to read, write, and manage large datasets extracted from the Apache Hadoop Distributed File System (HDFS), one aspect of a larger Hadoop Ecosystem. Hive was originally created as a way to query HDFS with SQL instead of MapReduce.
Today, Hive can be used to query any type of cloud object storage and is useful for running infrequent, batch SQL queries. Hive can handle large datasets with billions of rows, but query latencies can be high due to the overhead of file listing within the metadata process.
Apache Iceberg is a high-performance open table format that provides fast and scalable analytics on large datasets. Iceberg's design allows for faster queries and more efficient data processing by providing features like efficient updates and deletes, snapshot isolation, and partitioning.
Both Iceberg and Hive are efficient technologies for querying large datasets, but the choice depends on the requirements of your use case.
Let's compare their features:
There are two potential methods we can use to migrate data from a Hive table to an Iceberg table:
Let's explore the definition and benefits of each method.
This approach creates a second Iceberg table off of the original Hive table. By leveraging a "shadow" process, we are afforded the following benefits:
The in-place data migration method avoids rewriting the data. Instead, you write new Apache Iceberg tables comprising the existing files in your S3 bucket. By leveraging an "in-place" process, we are afforded the following benefits:
Ultimately, the method you choose will depend on many factors. The good news is that you can use Starburst Galaxy to migrate from Hive to Iceberg regardless of which method you choose.
It's time to build and populate the Hive tables you'll be using in this tutorial. You'll create two tables – one for each migration method. The tables will be built in the catalog and schema that you created as a prerequisite to this tutorial.
Note: In practice, you would use existing Hive tables to migrate to Iceberg.
Creating and modifying tables requires a role with advanced permissions. In this tutorial, we will utilize the accountadmin role.
In Starburst Galaxy, SQL is run from the Query editor.
The first table you create will use JSON files to store its data. You will use CTAS to create the table and pull data from the tpch
catalog to populate it.
For example:
cluster: aws-us-east-1-free
catalog: tmp_cat
schema: tmp_erin_rosas_02152
CREATE TABLE cust_json
WITH (type='hive', format='json')
AS SELECT * FROM tpch.sf1.customer;
SHOW CREATE TABLE cust_json;
This table will be almost identical to the one you just created. The difference is that it will use ORC files to store its data.
CREATE TABLE cust_orc
WITH (type='hive', format='orc')
AS SELECT * FROM tpch.sf1.customer;
SHOW CREATE TABLE cust_orc;
Let's explore the shadow migration process first. We will use the table called cust_json
for testing.
To convert this table to Iceberg format, we will leverage a CTAS statement with a single parameter added in the WITH
clause to identify that the new table will be created using the Iceberg table format.
It's time to run the SQL to migrate your table from Hive to Iceberg via the shadow migration method.
CREATE TABLE cust_iceberg_shadow
WITH (type='iceberg')
AS SELECT * FROM cust_json;
SHOW CREATE TABLE cust_iceberg_shadow;
Now it's time to try the in-place method for Hive to Iceberg migration. This method requires an ALTER TABLE
statement with the SET PROPERTIES
parameter identifying a change to the Iceberg table format, as detailed in our documentation.
ALTER TABLE
statementYou'll be working in the Starburst Galaxy query editor again for this step.
ALTER TABLE cust_orc
SET PROPERTIES type = 'ICEBERG';
SHOW CREATE TABLE cust_orc;
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of how and when to use each of the Hive to Iceberg migration methods.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!