Last Updated: 2023-12-14
Storing data in a data lake and accessing it directly from the lake can help your organization reduce costs by avoiding costly data storage solutions. However, the files landed in a data lake must be registered to a metastore before they can be queried. As your organization lands files from many different sources into its data lake on an hourly, daily, or weekly basis, data managers may struggle to keep up with the file registration necessary for accessing data from a data lake.
This is especially burdensome with the Azure Blob and Google Cloud Storage object stores, which don't include an event manager to manage files as they land. When the data managers in an organization aren't able to quickly register files, data consumers have to wait for the newest data or work with stale data.
Starburst Galaxy's schema discovery feature helps your organization solve this challenge. Data managers run schema discovery on a data lake catalog to find and register new files with the metastore of their choice. This feature will decrease the amount of time it takes for the most up-to-date data to get into the hands of data consumers.
In this tutorial, you will configure a catalog in Starburst Galaxy that connects to an Amazon S3 object store. You will then run schema discovery on this S3 data lake to discover existing schema and tables.
If you are a data engineer, this tutorial will show you how easy it is to use schema discovery to facilitate data lake file registration.
You need a Starburst Galaxy account to complete this tutorial. Please be sure to complete the tutorial titled Starburst Galaxy: Getting started before attempting this tutorial.
Upon successful completion of this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
Burst Bank relies on Amazon S3 object storage to house a significant portion of its data. As the bank expands, the volume of data pouring into this data lake increases daily.
Unfortunately, the data engineering team has been struggling to keep up with the task of registering new files to the metastore. This process is essential for enabling data consumers to utilize the data effectively for analytics.
Fortunately, Burst Bank utilizes Starburst Galaxy, a solution that offers schema discovery to address their current challenge. Your job is to help the data engineers at Burst Bank by showing them the schema discovery feature.
Schema discovery is a process in data management that involves automatically identifying and understanding the structure of a database, data warehouse, or data lake.
In Starburst Galaxy, schema discovery works on a data lake by searching your object store to find the metadata corresponding to the schemas, tables, and partitions. Once the schemas have been discovered, a preview of the tables and columns is generated. Schemas can then be added easily, and queried in the normal way.
Starburst Galaxy also tracks the schemas that have been added with schema discovery. You can use the Starburst Galaxy Web UI to see the changes made to a catalog's schema. The audit log will also display any errors associated with the columns in the current or previous schema.
The following video walks through all the steps in this tutorial.
You can choose to watch the video and follow along using your own account. Alternatively, if you prefer, you can skip the video and proceed directly to the step-by-step instructions provided later in the tutorial.
You're going to begin by signing in to Starburst Galaxy and setting your role.
For this tutorial, we have set up a shared Amazon S3 training bucket containing sample data. All connection credentials will be provided in the steps below.
This is a quick step, but an important one.
Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.
Starburst Galaxy separates users by role. Configuring a new catalog will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.
Your current role is listed in the top right-hand corner of the screen.
Adding a new Amazon S3 catalog follows the same process as adding other data sources in Starburst Galaxy. This is one of the main ways that Starburst Galaxy is used to connect to data lakes.
The steps below will show you how to start the process of configuring a new catalog.
Create a new catalog for your Amazon S3 data source.
Starburst Galaxy allows the creation of catalogs for a number of different data sources. In this case, you are going to create a new catalog in the Amazon S3 category.
The catalog needs both a name and description. This ensures that you can find it later.
schema_discovery
. When you connect Starburst Galaxy to a new data source, it is necessary to undergo an authentication process. This helps ensure that you are connecting the right data source and that you have the appropriate permissions.
Starburst Galaxy allows you to configure several different authentication methods when creating a new catalog. This lets you connect to data sources of different types.
For this tutorial, you're going to choose the AWS access key method. It uses an access key and secret key pairing, which we will provide for this tutorial.
AKIAYUW62MUVVQ2OP34U
qu7NRVypctO7/86OmBLgHJa64ij3k/mVuuZD2y1U
Starburst Galaxy uses a metastore to keep track of the location of your data when it is added to the data lake, in this case to Amazon S3.
You can use three different types of metastore with Amazon S3:
For this tutorial, you will use the Galaxy Metastore, which removes the need to configure and manage a separate Hive Metastore Service.
query-plan-labs-data-external
.burst_bank
. Table formats control the way that data is stored. These include popular modern, open table formats like Iceberg or Delta Lake, or older table formats like Hive.
In this tutorial, you will be using Iceberg, which is the newest and most advanced of the table formats. It is considered best practice to use Iceberg whenever possible with Starburst Galaxy, which is designed to take advantage of its many enhanced features.
Every new catalog connection includes a test before you connect it. This helps to ensure that you have input the correct credentials and allows you to quickly fix any problems before actually connecting.
You're almost there! Time to test the connection and then complete the process of creating your new Amazon S3 catalog.
Starburst Galaxy allows you to grant or restrict read access. This is an important feature in production environments.
At this point, you can either add the new catalog to a cluster, or choose to skip this and connect it later.
In this tutorial, you're going to add the catalog to your cluster right away.
aws-us-east-1-free
.Now that you've set up an Amazon S3 catalog, it's time to test schema discovery for Burst Bank using Starburst Galaxy. This is the exciting part where you get to dive deep into the capabilities of this feature.
Let's get started!
Getting started with schema discovery is easy. In fact, we're so sure that you'll need it often, that Starburst Galaxy automatically offers to run schema discovery when you add a new catalog to a cluster.
You just added the schema_discovery
Amazon S3 catalog, so it's time to run schema discovery on that catalog.
Now it's time to configure your schema discovery search by providing a Catalog location URL and Default Schema.
s3://query-plan-labs-data-external/burst_bank/
burst_bank
.While you wait, review the following important information regarding schema naming.
Your schema discovery process should now be complete!
Nine tables were discovered during the schema discovery. Let's inspect them one-by-one and then create each of these tables in the schema_discovery.burst_bank
schema.
burst_bank
schema field to view the nine tables that were discovered.Now that you've created the new tables, it's time to test them out using queries. To do this, you'll need to go to the query editor.
Now that you're in the query editor, it's time to select one of the new tables to query.
You could choose any of them, but let's go with the account
table.
schema_discovery
catalog.burst_bank
schema.account
table, click the ellipses to display the query suggestion menu.Enter SELECT * FROM account LIMIT 10
.Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use schema discovery in Starburst Galaxy.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!