Last Updated: 2024-08-28

Background

Starburst is a powerful data lake analytics platform that lets you discover, govern, and query your data whether it resides in the lake or not. The new streaming ingest feature in Starburst Galaxy unlocks your data from behind slow, complex pipeline configurations, providing a simple user experience for setting up a streaming pipeline.

Today, more than ever, organizations know it is critical to be able to use all their data when making critical business decisions. All too often data is left out of the equation due to the complexity of harnessing it. Data engineers know that setting up and maintaining a streaming pipeline is not only complicated but brittle due to all the components that must be strung together. This problem is exacerbated by the fact that more and more streaming sources are coming online every day.

With this issue in mind, our team at Starburst knew that it would be essential to add a streaming ingest feature to our platform. In fact, it is considered one of the core pillars of an Icehouse architecture, as outlined in the Icehouse Manifesto.

By leveraging Starburst Galaxy as your data platform for streaming ingest, you give your data engineers the tool they need to ensure no data is left behind!

Scope of tutorial

In this tutorial, you will take on the role of a Data Engineer to build out a streaming ingest pipeline.

You will complete the following steps:

  1. Use Confluent Cloud to configure a Kafka topic and connector to generate some fictional stock trading data.
  2. Configure a connection to the topic in Galaxy.
  3. Configure a live table to stream your data into.
  4. Complete a simple SQL statement on the live table data.

Prerequisites

Learning outcomes

Once you've completed this tutorial, you will be able to:

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

Background

You'll begin your work in Confluent Cloud by setting up a cluster. As written in the Confluent documentation, "A Kafka cluster is a group of interconnected Kafka brokers that manage and distribute real-time data streaming, processing, and storage as if they are a single system." A basic cluster will be sufficient for use in this tutorial.

Step 1: Add cluster

When you log in to Confluent Cloud, you should be on the home page, which has a shortcut to create a cluster.

Step 2: Select region

For this tutorial, you must use AWS as the cloud provider and N. Virginia (us-east-1) as the region.

Step 3: Provide payment information (if needed)

If this is your first time using Confluent Cloud, you'll be asked to provide payment information to be used after your free credits expire. The account used for this tutorial was given the option to bypass this step, but that may not be the case for your account.

Step 4: Review and launch

Background

A Kafka topic is a user-defined category that acts as a repository for event messages. Topics allow producers to send data and consumers to read data from them.

In this section, you will create a topic that will later be used by Galaxy to continuously read messages from and write the parsed messages into an Iceberg table.

Step 1: Create new topic

At the end of the previous section, you should have been directed to your new cluster. We'll start from there.

Step 2: Provide name and partitions

Step 3: Skip schema creation

You're going to skip the next section as there is no need to create a schema for this tutorial. The streaming ingestion feature does not currently support schema registry. If you accidentally create a schema in Confluent, the streaming ingestion feature will not work.

Background

A Kafka Connector is a component of Kafka Connect, a framework in Apache Kafka that enables the integration of Kafka with external systems. Kafka Connect simplifies the process of streaming data between Kafka topics and other data systems, such as databases, message queues, or file systems, without the need for custom code.

In this section, you will create a connector that generates synthetic data to publish to the Kafka topic you just created. This synthetic data is what will be ingested into the Iceberg table as part of the streaming ingestion you set up later in Starburst Galaxy.

Step 1: Navigate to Connector plugins

Step 2: Select the topic

In this step, you'll select the topic you just created as part of your connector configuration.

Step 3: Create a Service Account

You're now ready to create a Service Account that your connector will use to securely communicate with your Kafka cluster.

Step 4: Configure connector

In this step, you'll select the desired output record value format and schema.

Step 5: Review connector sizing

Step 6: Complete connector configuration

Background

In this section, you will create an API key that Starburst Galaxy can use to authenticate to your Kafka cluster. When you configure streaming ingestion later in this tutorial, you will input this API key as part of the configuration

Step 1: Create new key

Step 2: Select service account

The API key you create will be specifically for the service account you created in the last section of this tutorial.

Step 3: Provide description and copy key

It's very important to provide a meaningful description so that you know the purpose of the key. In addition, it is important to copy and save the key and secret in a safe place, as this will be your only chance to do so.

Background

You're almost finished with the Kafka configuration. You just need to add a few more permissions to your topic to allow Starburst Galaxy to interact with it.

Step 1: Select API key

Step 2: Add first topic ACL

Step 3: Add second topic ACL


Step 4: Review cluster settings

This step is just to show you where you can go for Bootstrap server information in the future.

Background

It's time to switch over to Starburst Galaxy, where you'll remain for the rest of this tutorial. In this section, you'll begin by creating a schema to hold your ingest tables. Then, you'll connect to the Kafka topic you created earlier and create a live table for the data.

Step 1: Create a schema

When you create your live table, you'll be required to select a catalog and schema for the table. You should already have the catalog set up. This step will show you how to create a schema within that catalog.

CREATE SCHEMA <your-catalog>.streams;

Step 2: Begin data ingest connection

Step 3: Select stream source

Note that you have three options. More connection options are planned in future Starburst Galaxy updates.

Step 4: Add source details

It's time to input the API key you created in Confluent Cloud. Make sure to have it handy.

Step 5: Create live table

Now you're going to connect to the stock_trades topic you created in Confluent Cloud to continuously ingest the data into a managed Iceberg table.

Step 6: Map to columns

Starburst Galaxy automatically suggests a table schema by inferring from the Kafka messages on the topic. You have the option to edit these, if you'd like. For this tutorial, you can leave them as they are.

Background

Now that you've completed your streaming ingest setup, it's time to query the data! Note that it can take a few minutes before the data initially lands. New data will be continuously written to the table every 1-2 minutes.

Step 1: Explore new table and view

It's time to head back to the Query editor. We'll begin by expanding the streams schema to see not just a new table but a view as well.

Step 2: Learn about the raw table

The raw table includes the parsed message along with additional metadata information. This metadata information is useful for debugging either by the customer or Starburst support.

The following is a breakdown of the raw table columns and their descriptions:

$system: A unique ID for each row plus the ingested timestamp.

raw_key: The Kafka key in varbinary.

raw_value: The message value in varbinary.

headers: Kafka events can have multiple header fields in them.

timestamp: Timestamp for the Kafka record. (different from the ingested timestamp in the $system column). This can be empty.

timestamp_type: Denotes the type above the timestamp. Options could be no timestamp, create_time, or log_append_time.

leader_epoch: Kafka offset is a combination of the actual offset number and the leader epoch field.

topic: The topic name.

partition: The partition index the message came from.

offset: The Kafka offset the message came from.

parsed: The parsed message (this is the row type and is expanded into the view).

Step 3: Learn about the view

The view is the table name you chose during the configuration. It is a view on top of the raw table to hide the metadata information. In the future, this will be an actual table instead of a view.

The following is a breakdown of the view columns:

$system: A unique ID for each row plus the ingested timestamp.

The remaining columns come from the parse messages. They are derived from the parsed column in the raw table.

Step 4: Query the view

Now it's time to see how easy it is to query the ingested data.

Step 5: Explore dead letter table

The streaming ingest feature also includes a table that is used as a dead letter table. This is where messages that cannot be parsed using the table schema are written. You can query this table like any other table, and it can be used to understand why the messages could not be ingested. The table is of the format __raw$errors and must be encapsulated in quotations when querying. The messages in this tutorial should always be parsed so you won't find any data in this table, but in a real production scenario, bad messages can make their way onto a topic and end up in this table.

The following is a breakdown of the dead letter table columns and their descriptions:

$system: A unique ID for each row plus the ingested timestamp.

error: The error message.

topic: The topic name.

partition: The partition index the message came from.

offset: The Kafka offset the message came from.

timestamp: Timestamp for the Kafka record. (different from the ingested timestamp in the $system column). This can be empty.

timestamp_type: Denotes the type above the timestamp. Options could be no timestamp, create_time, or log_append_time.

leader_epoch: Kafka offset is a combination of the actual offset number and the leader epoch field.

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use streaming ingest in Starburst Galaxy. Stay tuned for more tutorials using this feature!

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

  • Query your data lake fast with Starburst's best-in-class MPP SQL query engine
  • Get up and running in less than 5 minutes
  • Easily deploy clusters in AWS, Azure and Google Cloud
For more deployment options:
Download Starburst Enterprise

Please fill in all required fields and ensure you are using a valid email address.