Last Updated: 2024-08-28
Starburst is a powerful data lake analytics platform that lets you discover, govern, and query your data whether it resides in the lake or not. The new streaming ingest feature in Starburst Galaxy unlocks your data from behind slow, complex pipeline configurations, providing a simple user experience for setting up a streaming pipeline.
Today, more than ever, organizations know it is critical to be able to use all their data when making critical business decisions. All too often data is left out of the equation due to the complexity of harnessing it. Data engineers know that setting up and maintaining a streaming pipeline is not only complicated but brittle due to all the components that must be strung together. This problem is exacerbated by the fact that more and more streaming sources are coming online every day.
With this issue in mind, our team at Starburst knew that it would be essential to add a streaming ingest feature to our platform. In fact, it is considered one of the core pillars of an Icehouse architecture, as outlined in the Icehouse Manifesto.
By leveraging Starburst Galaxy as your data platform for streaming ingest, you give your data engineers the tool they need to ensure no data is left behind!
In this tutorial, you will take on the role of a Data Engineer to build out a streaming ingest pipeline.
You will complete the following steps:
Once you've completed this tutorial, you will be able to:
Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.
As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.
You'll begin your work in Confluent Cloud by setting up a cluster. As written in the Confluent documentation, "A Kafka cluster is a group of interconnected Kafka brokers that manage and distribute real-time data streaming, processing, and storage as if they are a single system." A basic cluster will be sufficient for use in this tutorial.
When you log in to Confluent Cloud, you should be on the home page, which has a shortcut to create a cluster.
For this tutorial, you must use AWS as the cloud provider and N. Virginia (us-east-1) as the region.
If this is your first time using Confluent Cloud, you'll be asked to provide payment information to be used after your free credits expire. The account used for this tutorial was given the option to bypass this step, but that may not be the case for your account.
A Kafka topic is a user-defined category that acts as a repository for event messages. Topics allow producers to send data and consumers to read data from them.
In this section, you will create a topic that will later be used by Galaxy to continuously read messages from and write the parsed messages into an Iceberg table.
At the end of the previous section, you should have been directed to your new cluster. We'll start from there.
You're going to skip the next section as there is no need to create a schema for this tutorial. The streaming ingestion feature does not currently support schema registry. If you accidentally create a schema in Confluent, the streaming ingestion feature will not work.
A Kafka Connector is a component of Kafka Connect, a framework in Apache Kafka that enables the integration of Kafka with external systems. Kafka Connect simplifies the process of streaming data between Kafka topics and other data systems, such as databases, message queues, or file systems, without the need for custom code.
In this section, you will create a connector that generates synthetic data to publish to the Kafka topic you just created. This synthetic data is what will be ingested into the Iceberg table as part of the streaming ingestion you set up later in Starburst Galaxy.
In this step, you'll select the topic you just created as part of your connector configuration.
You're now ready to create a Service Account that your connector will use to securely communicate with your Kafka cluster.
In this step, you'll select the desired output record value format and schema.
In this section, you will create an API key that Starburst Galaxy can use to authenticate to your Kafka cluster. When you configure streaming ingestion later in this tutorial, you will input this API key as part of the configuration
The API key you create will be specifically for the service account you created in the last section of this tutorial.
It's very important to provide a meaningful description so that you know the purpose of the key. In addition, it is important to copy and save the key and secret in a safe place, as this will be your only chance to do so.
You're almost finished with the Kafka configuration. You just need to add a few more permissions to your topic to allow Starburst Galaxy to interact with it.
This step is just to show you where you can go for Bootstrap server information in the future.
It's time to switch over to Starburst Galaxy, where you'll remain for the rest of this tutorial. In this section, you'll begin by creating a schema to hold your ingest tables. Then, you'll connect to the Kafka topic you created earlier and create a live table for the data.
When you create your live table, you'll be required to select a catalog and schema for the table. You should already have the catalog set up. This step will show you how to create a schema within that catalog.
with your actual catalog name.CREATE SCHEMA <your-catalog>.streams;
Note that you have three options. More connection options are planned in future Starburst Galaxy updates.
It's time to input the API key you created in Confluent Cloud. Make sure to have it handy.
Now you're going to connect to the stock_trades topic you created in Confluent Cloud to continuously ingest the data into a managed Iceberg table.
Starburst Galaxy automatically suggests a table schema by inferring from the Kafka messages on the topic. You have the option to edit these, if you'd like. For this tutorial, you can leave them as they are.
Now that you've completed your streaming ingest setup, it's time to query the data! Note that it can take a few minutes before the data initially lands. New data will be continuously written to the table every 1-2 minutes.
It's time to head back to the Query editor. We'll begin by expanding the streams
schema to see not just a new table but a view as well.
stock_trades_raw
, while the view is called stock_trades
.The raw table includes the parsed message along with additional metadata information. This metadata information is useful for debugging either by the customer or Starburst support.
The following is a breakdown of the raw table columns and their descriptions:
$system
: A unique ID for each row plus the ingested timestamp.
raw_key
: The Kafka key in varbinary.
raw_value
: The message value in varbinary.
headers
: Kafka events can have multiple header fields in them.
timestamp
: Timestamp for the Kafka record. (different from the ingested timestamp in the $system
column). This can be empty.
timestamp_type
: Denotes the type above the timestamp. Options could be no timestamp
, create_time
, or log_append_time
.
leader_epoch
: Kafka offset is a combination of the actual offset number and the leader epoch field.
topic
: The topic name.
partition
: The partition index the message came from.
offset
: The Kafka offset the message came from.
parsed
: The parsed message (this is the row type and is expanded into the view).
The view is the table name you chose during the configuration. It is a view on top of the raw table to hide the metadata information. In the future, this will be an actual table instead of a view.
The following is a breakdown of the view columns:
$system
: A unique ID for each row plus the ingested timestamp.
The remaining columns come from the parse messages. They are derived from the parsed
column in the raw table.
Now it's time to see how easy it is to query the ingested data.
SELECT * FROM
the view with LIMIT 10
.The streaming ingest feature also includes a table that is used as a dead letter table. This is where messages that cannot be parsed using the table schema are written. You can query this table like any other table, and it can be used to understand why the messages could not be ingested. The table is of the format
and must be encapsulated in quotations when querying. The messages in this tutorial should always be parsed so you won't find any data in this table, but in a real production scenario, bad messages can make their way onto a topic and end up in this table.
The following is a breakdown of the dead letter table columns and their descriptions:
$system
: A unique ID for each row plus the ingested timestamp.
error
: The error message.
topic
: The topic name.
partition
: The partition index the message came from.
offset
: The Kafka offset the message came from.
timestamp
: Timestamp for the Kafka record. (different from the ingested timestamp in the $system
column). This can be empty.
timestamp_type
: Denotes the type above the timestamp. Options could be no timestamp
, create_time
, or log_append_time
.
leader_epoch
: Kafka offset is a combination of the actual offset number and the leader epoch field.
Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.
Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use streaming ingest in Starburst Galaxy. Stay tuned for more tutorials using this feature!
At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.
Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.
Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!