Configure an Amazon S3 catalog

15 mins remaining

1. Tutorial overview

Last Updated: 2024-01-03

Background

Amazon S3 is a cloud-based storage solution provided by Amazon Web Services (AWS).

It is designed to store and manage large amounts of data in a scalable and secure manner. It is one of the main underlying technologies used to create data lakes based on cloud object storage.

Scope of tutorial

In this tutorial, you will learn how to configure a catalog in Starburst Galaxy that connects to Amazon S3 object storage.

Learning objectives

Once you've completed this tutorial, you will be able to:

  • Configure an Amazon S3 catalog using Starburst Galaxy.
  • Choose which metastore and table format is best for your S3 catalog.
  • Access your own S3 data source using Starburst Galaxy.

Prerequisites

  • You need a Starburst Galaxy account to complete this tutorial. Please see Starburst Galaxy: Getting started for instructions on setting up a free account.
  • This tutorial comes with a bring your own storage requirement. Before proceeding with this lesson, you must already have an Amazon S3 data lake set up.

    If this is not the case, please set this up first then return to this tutorial.

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

2. Sign into Starburst Galaxy and set Admin role

Background

You're going to begin by signing in to Starburst Galaxy and setting your role to begin the process of connecting your AWS S3 datasource.

This is a quick step, but an important one.

Step 1: Sign into Starburst Galaxy

Sign into Starburst Galaxy in the usual way. If you have not already set up an account, you can do that here.

Step 2: Set your role

Your current role is listed in the top right-hand corner of the screen.

  • Check your role, to ensure that it is set to accountadmin.
  • If it is set to anything else, use the drop-down menu to select the correct role.

3. Create new Amazon S3 catalog

Background

Adding a new Amazon S3 catalog follows the same process as adding other data sources in Starburst Galaxy. This is one of the main ways that Starburst Galaxy is used to connect to data lakes.

The steps below will show you how to start the process of configuring a new catalog.

Step 1: Create a new catalog

Create a new catalog for your Amazon S3 data source.

  • In the left-hand navigation bar, click Data>>Catalogs.
  • Click the Create catalog button.

Step 2: Select Amazon S3 datasource

Starburst Galaxy allows the creation of catalogs for a number of different data sources. In this case, you are going to create a new catalog in the Amazon S3 category.

  • Click the Amazon S3 tile.

Step 3: Input name and description

The catalog needs both a name and description. This ensures that you can find it later.

  • In the Catalog name field, enter the name of the new catalog.
  • In the Description field, input a description. This can be anything you want, so make it meaningful for you.
  • Scroll down to continue the configuration process.

4. Amazon S3 Authentication

Background

When you connect Starburst Galaxy to a new data source, it is necessary to undergo an authentication process. This helps ensure that you are connecting the right data source and that you have the appropriate permissions.

Step 1: Choosing an authentication method

Starburst Galaxy allows you to configure several different authentication methods when creating a new catalog. This lets you connect to data sources of different types.

  • Choose your connection type. Starburst Galaxy supports the following methods of authentication with AWS.

Step 2 (Option 1): Cross account IAM role

Use this option if you have already worked with your cloud security engineer to create an IAM cross account role configuration with Starburst Galaxy.

  • If you chose to authenticate with a Cross account IAM role, select your role from the drop-down list.

Step 2 (Option 2): AWS access key

Use this option if your cloud security engineer has given you an AWS Access Key\Secret Key pair to use for authentication.

  • If you chose to authenticate with an AWS access key, enter your AWS access key and AWS secret key.

5. Connect to Metastore

Background

Starburst Galaxy uses a metastore to keep track of the location of your data when it is added to the data lake, in this case to Amazon S3.

You have three options when choosing a metastore. Take some time to consider which is best for you then proceed with the steps corresponding to the metastore of your choice.

When setting up a Galaxy catalog to work with either the Starburst Galaxy or AWS Glue metastores, you will need to provide either an AWS AccessKey/SecretKey pair from an IAM User or an AWS Cross-account IAM role. Both of these AWS IAM identities acquire their privileges through the assignment of an IAM policy. The actual privileges granted are defined within the IAM policy configuration.

For your reference, this section provides the specific privileges that must be included in the IAM policies assigned to IAM Users or Roles used for configuring a Galaxy catalog.

Step 1: Select the Metastore

Starburst Galaxy allows you to use three different types of metastore with Amazon S3:

  • Galaxy Metastore
  • AWS Glue Metastore
  • Hive Metastore

The steps required to set up each metastore differ.

  • For the next part of this tutorial, follow the instructions corresponding to the metastore of your choice.

Step 2 (Option 1): Using the Starburst Galaxy Metastore

Starburst Galaxy includes its own metastore, which can be used to easily store metadata. Using this option is often the simplest metadata management solution.

The choice of metastore is completely decoupled from the choice of storage option, allowing you to mix and match.

  • Select Starburst Galaxy.
  • Enter the Default S3 bucket name.
  • Enter the Default directory name.
  • If desired, select Allow creating external tables.

This will allow you to create external tables outside of the default S3 bucket.

  • If desired, select Allow writing to external tables.

This will allow you to write data into external tables outside of the default S3 bucket.

Permissions required when using Starburst Galaxy metastore

If you choose to use the Starburst Galaxy Metastore, only S3 privileges need to be granted.

The following two permissions are required for read-only access:

  • s3:GetObject
  • s3:ListBucket

An AWS Cloud Security engineer can use the following JSON to grant the permissions listed above. If desired, you can expand the comma-separated list in this policy to include other S3 buckets as well.

{
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<s3-bucket-name>",
                "arn:aws:s3:::<s3-bucket-name>/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

The following four permissions are required for read/write access:

  • s3:GetObject
  • s3:ListBucket
  • s3:PutObject
  • s3:DeleteObject

An AWS Cloud Security engineer can use the following JSON to grant the permissions listed above. If desired, you can expand the comma-separated list in this policy to include other S3 buckets as well.

{
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<s3-bucket-name>",
                "arn:aws:s3:::<s3-bucket-name>/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Step 2 (Option 2): Using the AWS Glue Metastore

Starburst Galaxy also allows you to use AWS Glue as a metastore. This is sometimes a good option for certain users.

The choice of metastore is completely decoupled from the choice of storage option, allowing you to mix and match.

  • Select AWS Glue.
  • Select your AWS Glue and S3 bucket region.
  • Enter the Default S3 bucket name.
  • In the Default directory name field, input a directory name of your choice.
  • If desired, select Allow creating external tables.

This will allow you to create external tables outside of the default S3 bucket.

  • If desired, select Allow writing to external tables.

This will allow you to write data into external tables outside of the default S3 bucket.

  • Select Use authentication details configured for S3 access. This will allow you to reuse your previous credentials to establish Glue access privileges.

Permissions required when using AWS Glue metastore

If you choose to use the AWS Glue metastore, both Glue and S3 privileges need to be granted.

The following nine permissions are required for read-only access:

  • glue:BatchGetPartition
  • glue:GetDatabase
  • glue:GetDatabases
  • glue:GetPartition
  • glue:GetPartitions
  • glue:GetTable
  • glue:GetTables
  • s3:GetObject
  • s3:ListBucket

An AWS Cloud Security engineer can use the following JSON to grant the permissions listed above.

If desired, the line with "Resource": "*" can be altered to only allow access to specific Glue databases and tables. For more details on this, consult the AWS Glue documentation (Identity-based policy examples for AWS Glue - AWS Glue (amazon.com))

{
    "Statement": [
        {
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<s3-bucket-name>",
                "arn:aws:s3:::<s3-bucket-name>/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

The following nineteen permissions are required for read/write access:

  • glue:GetDatabase
  • glue:GetDatabases
  • glue:GetTable
  • glue:GetTables
  • glue:GetPartition
  • glue:GetPartitions
  • glue:BatchGetPartition
  • glue:CreateDatabase
  • glue:UpdateDatabase
  • glue:DeleteDatabase
  • glue:CreateTable
  • glue:UpdateTable
  • glue:DeleteTable
  • glue:CreatePartition
  • glue:UpdatePartition
  • glue:DeletePartition
  • glue:BatchCreatePartition
  • glue:BatchUpdatePartition
  • glue:BatchDeletePartition

An AWS Cloud Security engineer can use the following JSON to grant those permissions to a single S3 bucket. If desired, you can expand the comma-separated list in this policy to include other S3 buckets as well.

{
    "Statement": [
        {
            "Action": [
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition",
                "glue:CreateDatabase",
                "glue:UpdateDatabase",
                "glue:DeleteDatabase",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:CreatePartition",
                "glue:UpdatePartition",
                "glue:DeletePartition",
                "glue:BatchCreatePartition",
                "glue:BatchUpdatePartition",
                "glue:BatchDeletePartition"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::<s3-bucket-name>",
                "arn:aws:s3:::<s3-bucket-name>/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Step 2 (Option 3): Using the Hive Metastore

Starburst Galaxy also allows you to use the Hive Metastore. This is sometimes a good option for certain users.

The choice of metastore is completely decoupled from the choice of storage option, allowing you to mix and match.

  • Select Hive Metastore.
  • Select the Connection type you want to use, either a direct connection or a connection through SSH tunnel.
  • Select Connect via SSH tunnel from the drop-down menu.
  • In the Hive Metastore host field, input your IP address and DNS name.
  • In the Port field, input your Hive Metastore port number.
  • If desired, select Allow creating external tables.

This will allow you to create external tables outside of the default S3 bucket.

  • If desired, select Allow writing to external tables.

This will allow you to write data into external tables outside of the default S3 bucket.

6. Select table format

Background

Table formats control the way that data is stored. These include popular modern, open table formats like Iceberg or Delta Lake, or older table formats like Hive.

Choose the default table format that fits your use case. In many cases, the best option is Iceberg, and Starburst Galaxy is designed to take advantage of its many enhanced features.

Step 1: Select the default table format

Use the radio buttons to select the default table format. For most users, we recommend using Iceberg.

  • Select Iceberg, Hive, or Delta Lake.

7. Test connection and connect catalog

Background

Every new catalog connection includes a test before you connect it. This helps to ensure that you have input the correct credentials and allows you to quickly fix any problems before actually connecting.

Step 1: Test and Connect

You're almost there! Time to test the connection and then complete the process of creating your new Amazon S3 catalog.

  • Click the Test connection button.
  • Confirm that you see the Hooray! You can now add this catalog to a cluster message.
  • Click the Connect catalog button.

8. Configure access controls

Background

Starburst Galaxy allows you to configure your catalog in a number of ways regarding access controls. The most important of these involves granting write access or restricting the catalog to read-only access.

Take some time to consider whether you require write access, or whether read-only access will be sufficient.

Step 1: Select read access

Select the appropriate read access for your situation.

  • If you want to restrict the write access, select the read-only catalog button.
  • If you want to grant write access, deselect the read-only catalog button.
  • Click the Save access controls button.

Step 2: Add catalog to cluster or skip

At this point, you can either add the new catalog to a cluster, or choose to skip this and connect it later.

  • If you want to add the catalog to a cluster later, click Skip
  • If you want to add the catalog to a cluster now, select the cluster name from the drop-down menu and click Add to cluster.

9. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

You're all set! Now you can query the data in your Amazon S3 data lake.

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.

Functional/Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.