Last Updated: 2024-01-04
Google Cloud Storage (GCS) is a cloud-based storage solution provided by Google under the Google Cloud Platform (GCP).
It is designed to house and manage large amounts of data in a scalable and secure manner. It is one of the main underlying technologies used to create cloud object storage data lakes.
Scope of tutorial
In this tutorial, you will learn how to configure a catalog in Starburst Galaxy that connects to Google Cloud Storage (GCS).
Once you've completed this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
Starburst Galaxy separates users by role. Configuring a new catalog will require access to a role with appropriate privileges. Today, you'll be using the accountadmin role.
This is a quick step, but an important one.
Your current role is listed in the top right-hand corner of the screen.
Adding a new GCS bucket follows the same process as adding other data sources in Starburst Galaxy. This is one of the main ways that Starburst Galaxy is used to connect to data lakes.
The steps below will show you how to start the process of configuring a new catalog.
Create a new catalog for your GCS data source.
Starburst Galaxy allows the creation of catalogs for a number of different data sources. In this case, you are going to create a new catalog in the Google Cloud Storage category.
The catalog needs both a name and description. This ensures that you can find it later.
When you connect Starburst Galaxy to a new data source, it is necessary to undergo an authentication process. This helps ensure that you are connecting the right data source and that you have the appropriate permissions.
Starburst Galaxy supports authentication with GCS using a JSON key. This is the only method of authentication available.
Starburst Galaxy uses a metastore to keep track of the location of your data when it is added to the data lake, in this case to GCS.
You have two options when choosing a metastore. Take some time to consider your options, then proceed with the steps corresponding to the metastore of your choice.
Now, it's time to select the Metastore. Starburst Galaxy allows you to use two different types of metastore with GCS:
The steps required to set up each metastore differ.
Starburst Galaxy includes its own metastore, which can be used to easily store metadata. Using this option is often the simplest metadata management solution.
The choice of metastore is completely decoupled from the choice of storage option, allowing you to mix and match.
This will allow you to create external tables outside of the default GCS bucket.
This will allow you to write data into external tables outside of the default GCS bucket.
Starburst Galaxy also allows you to use the Hive Metastore. This is sometimes a good option for certain users.
The choice of metastore is completely decoupled from the choice of storage option, allowing you to mix and match.
This will allow you to create external tables outside of the default GCS bucket.
This will allow you to write data into external tables outside of the default GCS bucket.
Table formats control the way that data is stored. These include popular modern, open table formats like Iceberg or Delta Lake, or older table formats like Hive.
Choose the default table format that fits your use case. In many cases, the best option is Iceberg, and Starburst Galaxy is designed to take advantage of its many enhanced features.
Use the radio buttons to select the default table format. We highly recommend using Iceberg as a default for most users.
Every new catalog connection includes a test before you connect it. This helps to ensure that you have input the correct credentials and allows you to quickly fix any problems before actually connecting.
You're almost there! Time to test the connection and then complete the process of creating your new GCS catalog.
Starburst Galaxy allows you to configure your catalog in a number of ways regarding access controls. The most important of these involves granting write access or restricting the catalog to read-only access.
Take some time to consider whether you require write access, or whether read-only access will be sufficient.
Select the appropriate read access for your situation.
At this point, you can either add the new catalog to a cluster, or choose to skip this and connect it later.
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
You're all set! Now you can query the data in your GCS data lake.
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!