Build a Data Lakehouse Reporting Structure with dbt and Starburst Galaxy

Share

Since my first introduction to dbt, I was intrigued to say the least.

Working as a data engineer, I was attempting to manage complicated and comprehensive data pipelines without any real method of version control and fretting over the data quality in the output. None of our homegrown solutions to these challenges could compare to the sophisticated software principles we would then be able to integrate into our existing data ecosystem once we incorporated dbt.

I have a skeptic’s soul, and my entire existence I’ve shied away from jumping on any bandwagon until I feel the fanfare is justifiable – especially when it requires hours and hours of rework rewriting already existing and (somewhat) functioning code. Naturally, I firmly believed that dbt must actually earn the hype that it was receiving, and after experimenting myself and of course trying the beloved jaffle shop introduction, I was all in.

There was a problem. Our data was scattered between multiple storage locations so it was impossible to actually begin using dbt in our pipelines until we executed an extensive and time consuming data migration just to relocate all the data into one centralized location.

Believe what you will about the strategy to create one centralized location that stores all of your data, I personally think the single source of truth seems a little unrealistic in the current data climate. Nevertheless, we developed a two year plan to migrate a majority of our data into one storage location.

To recap, I was converted from a doubting cynic to becoming an evangelist of what I believed to be the beginning of new standards for the modern data engineer, yet I was required to wait 9-12 months before I could even start the journey.

dbt + Trino

It’s taken me six months, but I finally got my revenge. I’m now an evangelist of a new solution: dbt + Trino together.

This passion project is quite personal to me, for the reasons now revealed to you above, and I have now been able to successfully federate multiple sources together. I built models from S3, Snowflake, and the TCP-H dataset sources and landed each new Iceberg table in S3 to build out a data lakehouse reporting structure.

I chose Starburst Galaxy since it is the fastest and easiest way to get up and running with Trino today.

Integrating Starburst Galaxy with dbt sets up two game changing use cases to enable new functionality:

1. Access to new data sources that haven’t or can’t be added into your data lake or data warehouse.

Without Trino/Starburst Galaxy, dbt cannot access data sources that are not already landed in your centralized data location, which leads to incomplete data analysis and lower quality insights.

2. Multiple data platforms in use have no way of connecting between one another.

If you have your AI/ML workloads configured within one data platform and your BI/reporting workloads in another platform, the two have no way of communicating with each other.

      This example will tackle the first use case as we create an open data lakehouse architecture.

      Open data lakehouse architecture

      Inspired by Jeremy Cohen’s Cinco de Trino demo, the goal of this tutorial is to demonstrate the power of implementing dbt + Starburst Galaxy together to create a data lakehouse.  For more information about the Starburst data lakehouse, visit this blog.

      In this example, we are going to build our lakehouse reporting structure in S3 using Iceberg open table format and use slightly different naming conventions from the traditional Land, Structure, and Consume nomenclature to accommodate for dbt standards.

      Lakehouse architecture key:

      • Land = Staging
      • Structure = Intermediate
      • Consume = Aggregate

      Sources:

      • AWS COVID-19 data lake
      • Snowflake public COVID19 database
      • TPC-H standard datasets

      Why are we using so many data sources? Well, for this data lakehouse tutorial we will take you through all the steps of creating a reporting structure, which includes gathering your sources into your land layer in S3. Since dbt specializes as a transform tool and can only be utilized after the data is in one storage solution, Starburst Galaxy’s superpower with dbt is being able to federate data from multiple different sources into one dbt repository.

      Lineage graph:

      The lineage graph illustrates all the models within this project. Each source, highlighted in green and prefaced by the data location they originate from, will be built upon to create our three data lakehouse layers within S3 using Iceberg. You should notice an almost identical structure between the ideal architecture diagram displayed above and the resulting data lineage graph that displays the models throughout each layer of the reporting structure.

      For consistency’s sake, I only did joins between homogeneous source systems in my intermediate models. Since my two aggregate models follow the same logical structure just at varying rollup granularities, one by nation and one by region, I thought it best to keep the sources organized until that final aggregate output. I did appreciate the flexibility within the project repository itself, since each project has different constraints, and by having that ability I was able to structure my repository the way that worked the best for my environment.

      Tutorial Information

      The beauty of this tutorial is that I have pre-built all the models for you. This way you can quickly expose yourself to a working example and then try building a new model for yourself within an already developed project structure. It can be overwhelming to start any new project from scratch, even with the instructions laid out for you. My hope is that this type of tutorial allows you to quickly identify why combining these two techs together is so cool, and hopefully encourages you to experiment within this project structure yourself. I’ve added in this section a high level overview of the tutorial steps that can be used to supplement the repository readme instructions.

      Prerequisites

      You will need the following prerequisites to complete this tutorial:

      • Sign up for a Starburst Galaxy account. Starburst Galaxy will be our query engine. 
      • AWS account credentials. S3 will act as both a source and also as the target since we are implementing a data lakehouse architecture.
      • Snowflake account credentials. You don’t need the snowflake source to see the value, but the current tutorial is set up to incorporate tables from the public COVID19 snowflake database.
      • A dbt installation of your choosing. I used a virtual environment on my M1 mac because that was the most recommended. I’ll add the steps below in this readme. Review the other dbt core installation information to pick what works best for you.

      Getting started with AWS

      Create a new S3 bucket in the Ohio region (us-east-2) with some distinct naming convention such as dbt-aws-<username> (ex: dbt-aws-monica). You must specify this region because this is where the AWS COVID-19 public data lake exists. The S3 bucket will store both source data ingested from the COVID-19 data lake, while also acting as the target storage solution for the data lakehouse. Each purpose will be separated from the other by creating individual catalogs in Starburst Galaxy. Next, create an AWS access key that can authenticate the connection between S3 and Starburst Galaxy.

      Getting started with Starburst Galaxy

      After this section is complete, you will have three catalogs and one cluster ready in Starburst Galaxy. 

      First, start by creating an S3 Hive catalog with the credentials you created above. Name the catalog dbt_aws_src. This is the location where you will create a source table from the public AWS COVID-19 data lake.  Next, create a cluster named dbt-aws. Attach the dbt_aws_src catalog as well as the tpch catalog, which was already configured for you. Then, create another S3 catalog named dbt_aws_tgt, but this time use the open table format Iceberg. Add this catalog also to the existing dbt-aws cluster.

      Out of our three source systems, the only source table you need to configure yourself is the enigma table from the AWS COVID-19 data lake. Follow the Starburst Galaxy data lake tutorial to create this enigma_jhu table within the now existing S3 source catalog. 

          1. Configure role-based access control two object storage locations.
            • The public AWS COVID-19 data lake: s3://covid19-lake/*
            • Your newly created S3 bucket: s3://dbt-aws-<username>/*
          2. Create a schema in the source catalog using the object location you created earlier.
            create schema aws_covid_data_lake with (location='s3://dbt-aws-<username>/');
          3. Create the enigma_jhu table.
        CREATE TABLE enigma_jhu (
           fips VARCHAR,
           admin2 VARCHAR,
           province_state VARCHAR,
           country_region VARCHAR,
           last_update VARCHAR,
           latitude DOUBLE,
           longitude DOUBLE,
           confirmed INTEGER,
           deaths INTEGER,
           recovered INTEGER,
           active INTEGER,
           combined_key VARCHAR
        )
        WITH (
           format = 'json',
           EXTERNAL_LOCATION = 's3://covid19-lake/enigma-jhu/json/');
        

        You should now be able to see three catalogs in your cluster.

        Getting started with Snowflake

        Follow the instructions to create and configure a snowflake catalog from your snowflake credentials. Connect to the COVID19 database, and give your catalog a descriptive name such as dbt-snow. Connect this catalog to the existing dbt-aws cluster.

        After completing this step, you should have four catalogs connected to one cluster.

        Getting started with dbt

        Install the dbt-trino adapter plugin, which allows you to use dbt together with Trino and Starburst Galaxy. You may want to do this inside a python virtual environment. Visit the project repository or other helpful links for more instructions on installing dbt.

        After installing dbt, clone the Github repository.  Properly configure your ~/.dbt/profiles.yml file with the accurate information necessary to connect to Starburst Galaxy. Specify the catalog as dbt_aws_tgt and enter a meaningful schema name such as target. This schema is where your models will be created.

        Run the proper validations to verify the connection between dbt and Starburst Galaxy.

        1. dbt debug
        2. dbt deps

        Create your models

        After your connection is validated, but before you create your models using dbt run, you should see a similar cluster explorer as displayed below. All the sources are available to be queried, however the target tables have yet to be created.

        Execute the dbt run command, and watch the models execute.

        Now all the models for the entire data lakehouse have been created and executed. You can query each table individually and explore every reporting structure layer which works together to create the two final aggregated reports.  

        View the final aggregated nation and region reports.

        Execute the tests against each model. Then, create your own tests. Finally, try some other dbt commands to create documentation from the models in the repository.

        Closing Time

        I hope this brought you some clarity in regards to the benefits that occur when you combine two of my favorite technologies in the data space. Try this tutorial out for yourself using Starburst Galaxy, and tell me about what you think.

        Schedule a call with an expert

        Book time