Building a near real-time data lake with Onehouse and Starburst

  • Kyle Weller

    Kyle Weller

    Head of Product

    Onehouse

  • Matt Fuller

    Matt Fuller

    VP of Product

    Starburst

Share

As your organization becomes more data driven, the speed of your analytics can make a large difference in how much impact you can drive in your business. If you can send that ecommerce retargeting ad campaign to the user who abandoned their cart within minutes vs hours, or if you can catch fraudulent online transactions within seconds, or if you can recalculate supply chain routes based on real time changing environments – all of these time-sensitive activities and more can be optimized with real-time analytics.

Unfortunately the question becomes how do you easily build and scale a real-time analytics data stack that won’t break the bank?  As your business applications start to gain traction, your data platform will start to evolve through some common stages. First you explore the data with analytics directly on your operational database like Postgres or Mongo, but soon you start to see signs of strain – queries are taking longer, bottlenecks are appearing. Next, you might create a read-only replica of your operational store. However, dividing traffic can only go so far before the same limits begin to resurface. At this point you realize you need a proper analytics strategy and you may consider investing in DIY building a data lakehouse.

What if there was another way? With Onehouse and Starburst, you can get the near real-time experience of an ingestion tool and a data warehouse at a fraction of the cost. Onehouse provides the fastest way to bring data into your lake, and Starburst provides the fastest way to iterate on insights in your data lake.

Throughout this blog, we will guide you through the process of designing and implementing a near real-time data lake with Onehouse and Starburst.

So what can you build with Onehouse and Starburst together?

Let’s say you are building an online ecommerce retail business. The business launches a new product line and the application developers wire up app telemetry from the shopping cart and purchase orders into an operational Postgres database. After the launch your VP of Marketing asks to see this new data reflected in some customer 360 dashboards so they can make quick decisions on how much budget to continue allocating to active campaigns. 

If you try to build a CDC ingestion pipeline on your own it could take weeks to get prioritized for your data engineers and another 4-6 weeks to build the pipelines. For faster time to insights we will show you how any data engineer in your organization can use Onehouse and Starburst together to get this job done in minutes.

First, open your Onehouse console and navigate to the Sources page. Click add new source and select Postgres. You can provide connection details to your Kafka server and schema registry.

After providing some basic connection details to your database you can create a Stream Capture in Onehouse:

Onehouse will deploy Debezium, Kafka, and fully managed Spark jobs in your private networks to ingest and ETL your data from Postgres into a data lakehouse on your S3 buckets.

Under the hood, Onehouse leverages Apache Hudi for incremental processing and to automate away tedious data chores for data optimizations like clustering, file-sizing, indexing, and much more. While born from Hudi roots, you’re not tied to only Hudi. Onehouse recently launched a brand new project called Onetable that allows you to seamlessly interoperate your data between Hudi, Iceberg, and Delta Lake. Pick and choose which format you would like to use and you can even run as multiple formats with zero copying of data. This allows for the most optionality of your lake. Onetable will soon be released as a standalone open source project with diverse contributors and owners. If you would like early access to the project, reach out to info@onehouse.ai.

To make your data ingested by Onehouse into the lake directly available in Starburst, the integration between Onehouse and Starburst is handled through a catalog. Onehouse will sync your table metadata to the metastore catalogs of your choice, whether that is Glue, an external HMS, or others. For this example let’s use Starburst with an AWS Glue metastore catalog. Sign in to your Starburst account and create an S3 catalog. Follow documentation to choose a catalog name, set your authentication settings, and choose AWS Glue as the metastore:

The “default table format” setting chooses what format you will write data in. While Starburst does not yet have write support for Apache Hudi, read support works well and is supported out of the box irrespective of the option you choose below for your write settings. This Starburst feature is called “Great Lakes” connectivity, and it automatically infers the table format whether the data is in Hudi, Iceberg, Delta, or legacy Hive-style tables. The default write format is used so one does not have to specify the intended table format when creating a table. Once Hudi write support is added, there will be an option to add Hudi as a default write format. At the time of this blog, Trino and Hudi progress is in discussion here:

https://github.com/trinodb/trino/issues/9877

Now that permissions are ready to go, let’s create a Starburst cluster if one doesn’t already exist. Choose the S3 catalog you configured in the step above:

Your Onehouse data is already automatically ready in Starburst. All future tables/schemas you ingest from any supported Onehouse source will be automatically registered in the Starburst catalog you created. You will see the Onehouse databases and tables show up in the query editor and through the Starburst Catalog Explorer:

Now the real reason why we are here, let’s write some SQL! 

Conclusion

Now that you have seen the steps end to end, let’s recap on how simple it is to get started.

  1. Onehouse – Create source connection to Postgres
  2. Onehouse – Create a Stream Capture ingestion pipeline
  3. Starburst – Configure S3 catalog with Glue metastore
  4. Starburst – Create cluster
  5. Starburst – Have a SQL analytics party on your data

These 5 steps will take you minutes to set up and now you have a continuously updating stream of data from Kafka at your fingertips for analytics in Starburst. Onehouse is handling all of the advanced data optimizations and table maintenance behind the scenes allowing you to effortlessly scale your pipelines from GBs->PBs. With Starburst Galaxy in the cloud you get elastic scale Trino deployments and can also take advantage of advanced fault tolerant execution and new warp speed clusters.

While this demo showcased ingestion from Postgres, Onehouse supports even more sources including stream ingestion from Kafka and full CDC replication from popular databases like MySQL, Mongo. With Onehouse you can ingest your data to centrally managed S3 data lake tables on Hudi, Delta, or Iceberg making all data in your organization readily accessible for analytics in Starburst.