Building a federated data lakehouse with Starburst Galaxy

Share

We are eleven days into the new year, and I have spent the past two weeks exerting unreasonable amounts of effort trying to make this year different from the past and actually keep all my resolutions. Sleep well! Exercise! Practice self care! Wash your face! Complete all your daily tasks! The overwhelming nature of trying to form multiple lasting habits at the same time has been conflicting and exhausting. I have been arbitrarily working toward the success of each individual goal without understanding the consequences, and sometimes my methods were inevitably working against each other. It’s a lot easier to hit the snooze button and skip that morning workout when you don’t get enough sleep because you are spending your time the night before trying to tick off all the tasks from yesterday’s to-do list. Not fully understanding the ramifications of each individual action, I have failed to step backward, analyze the entire picture, and evaluate how the ecosystem of my life weaves together. This year, eleven days in, I’ve decided to pivot. Instead of declaring seventeen New Year’s resolutions I will inevitably break by March, I have resolved to group the appropriate resolutions as one mega resolution: practicing a more balanced and optimized lifestyle.

I’m not the only one creating resolutions for the New Year, both personally and professionally.  Organizations and individuals alike are striving to be their best self this year, which in the current economic climate looks much different than our previous sense of normal. The data community is following suit with 2023 trends seemingly pointing to a similar sentiment of optimization. Instead of supporting individual efforts that waste time, money, and energy, data engineers will spend this year attempting to be more efficient in all of their endeavors by hopefully optimizing their processes, systems, and architectures to do more with less. With fewer resources, the innovative but flaky custom shell script workaround for that leaky data pipeline may have a myriad of new consequences. Now more than ever, data teams are looking to implement solutions that are simple, efficient, and effective.

I immediately think about the time and energy I would have saved myself had I provided my business partners with an ANSI SQL interface that allowed them to query multiple databases at once. I also know that we all would have benefited from a simple way to achieve faster data lake analytics at a petabyte scale. In accordance with my long standing belief that the only way to prove value is to show, not tell, I’ve created a tutorial addressing both of these principles using Starburst Galaxy. Using Starburst Galaxy itself is already adding a level of simplicity and efficiency you would not get using Trino alone, but it also includes lots of additional beneficial features that help consolidate your data architecture such as role based access control, autoscaling, catalog insights, and much more that can help you be more efficient at your job. Nevertheless, let us dive into the tutorial together so that you can see for yourself if this is the right choice for you. This article is intended to support the github repository which has all the instructions contained within the readme. If you have any questions, you know where to find me. I’d love to help.

ds

 

Tutorial architecture

The premise of this example is that we have Pokémon Go data being ingested into S3, which contains each Pokémon’s encounter information. This includes the geo-location data of where each Pokémon spawned, and how long the Pokémon could be found at that location. What we don’t have is any information on that Pokemon’s abilities. That information is contained in the Pokédex stored in MongoDB which for our purposes I’ve cleverly nicknamed PokéMongoDB. It includes data about all the Pokémon including type, legendary status, catch rate, and more. To create meaningful insights from our data, we need to combine the incoming geo-location data with the static dimension CSV table located in MongoDB.

To do this, we will first build out a reporting structure in the data lake using Starburst Galaxy. The first step is to read the raw data stored in the land layer, then clean and optimize that data into more performant ORC files in the structure layer. Finally, we will join the spawn data and Pokédex data together into a single table that is cleaned and ready to be utilized by a data consumer. We can then create some visualizations to analyze which Pokémon are common to spawn in the San Francisco area.

In the video and the tutorial, I walk through all the setup required to put this data lakehouse architecture into action including creating my catalogs, cluster, schemas, and tables. After incorporating open table formats, applying native security, and building out a reporting structure, I have confidence that my data lakehouse is built to last, and end up with some really cool final Pokémon graphs.

Prerequisites

You will need the following prerequisites to complete this tutorial:

Set up your MongoDB and AWS sources

To get started, we must set the stage. Assuming you have a MongoDB connection set up, and an AWS account at your disposal, review the source files and upload each file to their respective location.  Make sure both sources are set up in the same region. Create an S3 bucket for your Amazon source (ex: s3://pokemon-demo-<identifier>/), and within that bucket create a subfolder to hold the Pokémon spawns CSV file (ex: s3://pokemon-demo-mm/pokemon-spawns-csv/pokemon-spawns.csv)Next, create an AWS access key that can authenticate the connection between S3 and Starburst Galaxy.

Configure Starburst Galaxy

After this section is complete, you will have two catalogs and one cluster configured in Starburst Galaxy.  Naming is important for your future querying so keep in mind your selected names for each catalog and cluster. 

First, start by creating an Amazon S3 Iceberg catalog named aws_pokemon. Authenticate to S3 through your previously created access key.  Select Starburst Galaxy as your metastore, and enter the name of your S3 bucket. For your default directory name, enter the name of your favorite Pokémon. Enable both creating and writing to external tables, and connect the catalog. Save the account admin access controls after connecting.

Next, create a MongoDB catalog named mongo_pokedex. Authenticate to MongoDB using either a direct connection or via SSH tunnel. I personally connected using my connection URL that I found extremely easily from my MongoDB Compass setup. Save the account admin access controls after connecting.

Once both catalogs are created, create your cluster named pokemon-cluster. Attach both previously created catalogs, aws_pokemon and mongo_pokedex to this cluster. Select your cluster size, cluster type, and cloud provider region. Then, create your cluster.

Finally, configure role-based access control for your object storage location.  Navigate to the Access control tab and select the Roles and privileges dropdown. Click into the highlighted account admin role, and add a new location privilege. Enter the S3 URI followed by a /* for your newly created S3 bucket. For example: s3://pokemon-demo-mm/*

Create your land layer

To create the land layer, you will first create a schema. Then we will create the raw table from the CSV in S3. Replace your AWS bucket location in the below SQL statement.

CREATE SCHEMA hive WITH (location='s3://AWS BUCKET/pokemon-hive/');

Create the Pokémon spawns table. Make sure to add in your external location to the S3 location of your Pokémon spawns CSV.

 CREATE TABLE aws_pokemon.hive.pokemon_spawns(
 "s2_id" VARCHAR,
 "s2_token" VARCHAR,
 "num" VARCHAR,
 "name" VARCHAR,
 "lat" VARCHAR,
 "long" VARCHAR,
 "encounter_ms" VARCHAR,
 "disappear_ms" VARCHAR
)
WITH (
 type = 'HIVE',
 format = 'CSV',
 external_location = 's3://S3 BUCKET NAME/CSV LOCATION/',
 skip_header_line_count=1
); 

Create your structure layer

We will create the structure tables in Iceberg format. To create the structure tables, we will first create a schema to hold all the Iceberg tables. Replace the AWS bucket name with your own.

CREATE SCHEMA hive WITH (location='s3://AWS BUCKET/pokemon-iceberg/');

Create the structure table for the Pokémon Go spawn data stored in S3. If you named your catalog differently than the example, replace appropriately in the SQL statement.

CREATE TABLE aws_pokemon.iceberg.pokemon_spawns
WITH (
  format = 'ORC'
) AS
SELECT
  CAST(num AS INTEGER) AS number,
  name,
  CAST(lat AS DOUBLE) AS lat,
  CAST(long AS DOUBLE) AS long,
  CAST(encounter_ms AS BIGINT) AS encounter_ms,
  CAST(disappear_ms AS BIGINT) AS disappear_ms
FROM aws_pokemon.hive.pokemon_spawns;

Create the structure table for the Pokédex data stored in MongoDB. Make sure you appropriately replace your FROM location with the starburst-galaxy-catalog.mongodb-database.mongodb-collection appropriate for your setup.

CREATE TABLE aws_pokemon.iceberg.pokemon_pokedex
WITH (
  format = 'ORC'
) AS
SELECT
  CAST(number AS INTEGER) AS number,
  name,
  "Type 1" AS type1,
  "Type 2" AS type2,
  CAST(json_parse(replace(replace(Abilities, '''s', 's'), '''', '"')) AS ARRAY(VARCHAR)) AS abilities,
  CAST(hp AS INTEGER) AS hp,
  CAST(att AS INTEGER) AS att,
  CAST(def AS INTEGER) AS def,
  CAST(spa AS INTEGER) AS spa,
  CAST(spd AS INTEGER) AS spd,
  CAST(spe AS INTEGER) AS spe,
  CAST(bst AS INTEGER) AS bst,
  CAST(mean AS DOUBLE) AS mean,
  CAST("Standard Deviation" AS DOUBLE) AS std_dev,
  CAST(generation AS DOUBLE) AS generation,
  "Experience type" AS experience_type,
  CAST("Experience to level 100" AS BIGINT) AS experience_to_lvl_100,
  CAST(CAST("Final Evolution" AS DOUBLE) AS BOOLEAN) AS final_evolution,
  CAST("Catch Rate" AS INTEGER) AS catch_rate,
  CAST(CAST("Legendary" AS DOUBLE) AS BOOLEAN) AS legendary,
  CAST(CAST("Mega Evolution" AS DOUBLE) AS BOOLEAN) AS mega_evolution,
  CAST(CAST("Alolan Form" AS DOUBLE) AS BOOLEAN) AS alolan_form,
  CAST(CAST("Galarian Form" AS DOUBLE) AS BOOLEAN) AS galarian_form,
  CAST("Against Normal" AS DOUBLE) AS against_normal,
  CAST("Against Fire" AS DOUBLE) AS against_fire,
  CAST("Against Water" AS DOUBLE) AS against_water,
  CAST("Against Electric" AS DOUBLE) AS against_electric,
  CAST("Against Grass" AS DOUBLE) AS against_grass,
  CAST("Against Ice" AS DOUBLE) AS against_ice,
  CAST("Against Fighting" AS DOUBLE) AS against_fighting,
  CAST("Against Poison" AS DOUBLE) AS against_poison,
  CAST("Against Ground" AS DOUBLE) AS against_ground,
  CAST("Against Flying" AS DOUBLE) AS against_flying,
  CAST("Against Psychic" AS DOUBLE) AS against_psychic,
  CAST("Against Bug" AS DOUBLE) AS against_bug,
  CAST("Against Rock" AS DOUBLE) AS against_rock,
  CAST("Against Ghost" AS DOUBLE) AS against_ghost,
  CAST("Against Dragon" AS DOUBLE) AS against_dragon,
  CAST("Against Dark" AS DOUBLE) AS against_dark,
  CAST("Against Steel" AS DOUBLE) AS against_steel,
  CAST("Against Fairy" AS DOUBLE) AS against_fairy,
  CAST("Height" AS DOUBLE) AS height,
  CAST("Weight" AS DOUBLE) AS weight,
  CAST("BMI" AS DOUBLE) AS bmi
FROM mongo_pokedex.pokemongo.pokedex;

Create your consume layer

Create the consume table. This will combine the Type 1, Type 2, and Mega Evolution status from the Pokédex for each Pokémon found in the Pokémon spawn data. We are only looking for Pokémon within the San Fransisco Bay Area, so we will restrain the latitude and longitude to do so. Make sure if you have a different naming convention for your catalogs and schemas, you change the SQL statement to appropriately match it.

CREATE TABLE aws_pokemon.iceberg.pokemon_spawns_by_type AS
SELECT s.*, p.type1, p.type2, p.legendary
FROM  aws_ketchum.iceberg_pokemon.pokemon_pokedex p 
 JOIN aws_ketchum.iceberg_pokemon.pokemon_spawns s 
 ON p.number = s.number and p.mega_evolution = FALSE
 WHERE lat >= 37.62 and lat <= 37.86 AND long >= -122.51 and long <= -122.12;

Finishing touches

I’ve downloaded the code to make this liveboard, it’s in the github repository under the ThoughtSpot folder. If you want to recreate this yourself and are having trouble, please reach out as I’m happy to help.

I hope you made it to the bottom of this page, even if you are power scrolling. As a reward, I am telling you about the best data conference ever, Datanova, our FREE and VIRTUAL conference February 8th and 9th, 2023. Sign up now, you won’t regret it.

Register for Datanova Now

Sign up today!