Share

In our previous post, we discussed how one can build a data ingestion solution to Iceberg tables using Starburst Galaxy together with Apache Flink and AWS Glue. This was an architecture we used internally at Starburst to build our internal Iceberg data lake of telemetry data generated by Galaxy. While it met our initial ingestion requirements, it became expensive, hard to manage, and failed to provide the exactly-once guarantees we soon realized we needed. 

After speaking with customers, we realized that this experience – building and scaling an ingestion solution – was not unique. Many teams have hit the same roadblocks we faced when stitching together separate technologies. This inspired us to rethink the core building blocks entirely and build a fully-managed streaming solution into our Starburst Galaxy platform. 

This blog post is part one of a three part blog series introducing our ingestion capability in Starburst Galaxy. In this first blog post, we provide an overview of this new capability and a walkthrough of how it works in Starburst Galaxy.

As of today, Starburst’s streaming ingest is available for private preview in select regions on AWS. If you’re interested in being a part of the private preview, apply here.

How data ingestion is typically built

At a high level, every company that has architected its own ingest solution has built some variation of the diagram below.

The lower level details are always different, but the high level concepts are the same. First, events are collected to a streaming platform such as Kafka. Then, those events are read out and written to object storage using another tool such as Kafka Connect or Flink. Next, a transformation of the data is performed to write it into something that can be efficiently queried, such as Apache Parquet. Spark is often used for this transformation, as well as to catalog the Parquet data in a metastore such as AWS Glue or a legacy Hive metastore.

Parts of these  systems are non-transactional, and usually provide at least once processing guarantees which can introduce duplicates into the data lake – risking the freshness and accuracy of the data lake. 

But that’s not all! In order to accomplish those tasks, the team has to write a lot of complex, custom code – such as building in orchestration logic or accounting for schema changes. All of these individual pieces of code are custom, brittle, and struggle to scale let alone meet near real-time performance standards. And, as these solutions scale, the infrastructure costs can get out of control – not to mention creating a very tired engineering team due to an chaotic on-call schedule. 

Finally, a lot of these custom ingest systems were developed prior to the popularity of Apache Iceberg. Even considering modifying the current ingest system to land the data in Iceberg will make the best data engineer grimace, despite all the wonderful advantages of Iceberg.

What is streaming ingest in Starburst Galaxy?

Starburst’s streaming ingestion is a first-of-its-kind solution for continuously ingesting data from a Kafka-compliant topic into the data lake in near real-time while also guaranteeing exactly-once semantics. Starburst’s Kafka ingestion service allows you to: 

  • Effortlessly connect to your Kafka topics and continuously write data into your data lake in Apache Iceberg format at internet-scale
  • Ensure your data is ready for analysis in near real-time with exactly-once guarantees
  • Experience Kafka as part of the Starburst Galaxy platform, with built-in data governance and data observability

As Trino, the engine underpinning Starburst Galaxy, is built for internet-scale data volumes, we built Starburst’s streaming ingest to handle large volumes of event data in a simple, scalable, and cost effective way. This includes event-generated data like clicks, page loads, UX interactions, ad impressions, sensor data, and more. 

And, while we initially architected this for near real-time data lake use cases, early customer conversations are showing the value of a fully-managed ingestion service for anyone looking to ingest Kafka topics into their lake, even if data freshness is not a required capability. Instead of bifurcating streaming and batch strategies, which adds further complexities downstream, customers can now ingest all their data via a single solution. Simply put, this ingestion service is the simplest and most cost effective way to ingest Kafka topics into your data lake.

How it works

Streaming ingest is represented in Starburst Galaxy as streams in the left hand navigation. Streams represent a connection to a topic in Kafka, Confluent, or another Kafka-compliant system. 

When a stream is running, it’s continuously reading messages from that topic, transforming those messages into a relational structure, and writing it to an Iceberg table. A stream can be paused and resumed as needed. For example, if something unexpected happens with the upstream system, you can easily pause the ingestion until the issue is resolved.

Creating a stream is a simple, three step process. You start by clicking the button to create a new stream which will bring you to a workflow where you configure the Kafka source, the data lake table target, and a descriptor of how to transform the Kafka message to the Iceberg table on the lake.

In order to explain our streaming ingestion solution further, let’s walk through connecting to the pizza orders demo topic in Confluent Cloud.

Step 1: Connect to stream source

Here we enter the Kafka connection, authentication, and topic details. Once we test the connection to make sure that we can connect and read messages, we are brought to the next step in the workflow where we specify the details of the target table in the lake.

 

Step 2: Create stream target

Here we enter the schema and table name of the Iceberg table we want the Kafka data to land in. We chose to call the table the same name as the Kafka topic, but it can be any name you choose. We also just called the schema ingest. But we recommend thinking about how you want to group your ingested tables and organizing your schema names around that pattern.

Step 3: Map to columns

Now that we have specified the details of the Kafka source and Iceberg target, we need to set up the mapping from the Kafka message to the Iceberg table. With our streaming ingest capabilities, we’re able to automatically infer the schema of the Kafka message. Our pizza_orders topics publishes JSON messages in the form of:

{
  "store_id": 1,
  "store_order_id": 2445,
  "coupon_code": 1976,
  "date": 18943,
  "status": "accepted",
  "order_lines": [
    {
      "product_id": 9,
      "category": "dessert",
      "quantity": 4,
      "unit_price": 24.69,
      "net_price": 98.76
    }
  ]
}

Our mapping will transform it into an Iceberg table of 6 columns. Columns store_id, store_order_id, coupon_code are BIGINT data types. Column status is a VARCHAR data type. And order_lines is an ARRAY type.

While our schema inference is highly accurate, there are also times where a user might need to manually override the inferred schema. A prime example of this is domain knowledge or user preferences that cannot be inferred – like the unnesting of a message. A user may want to keep the nested message encapsulated in a single column in the Iceberg table or they may want to unnest it and create multiple columns in the Iceberg table.

The mapping is editable in the UI for users to specify how they’d like the target Iceberg table to look. You can unnest messages, change the inferred data type or even drop specific columns if you don’t want them to land in the lake (e.g. PII data in the message). 

In our case, Galaxy’s automated schema inference did a good job and we don’t have to make any manual changes. That’s it. 

Now that the source, target, and mapping details are set up – we’re ready to query the ingested data!

Step 4: Query the ingested data

Querying the ingested table is no different than querying any other Iceberg table. We just need to  make sure the catalog containing the data is attached to the Galaxy cluster we’re using. Streaming ingest provides a dedicated catalog in each region. Once this is connected to the cluster, all future topics we configure for ingest will automatically show up and be queryable.

Looking for more information? Watch this demo video to see how it works in real time.

 

Summary

Ok, so we covered a lot! Let’s summarize. 

We discussed how customers are creating custom complex ingestion systems that follow a common set of patterns. These solutions are custom not because they have to be, but rather because there is no product on market that can meet their needs today.

We discussed how Starburst Galaxy’s streaming ingestion product fills that gap and provides a scalable and cost effective way to provide ingest streaming data into Iceberg tables for up to near real time requirements. You can configure this in three simple steps using Galaxy to configure the Kafka-compliant source, target Iceberg table, and transformation from source to target.

What’s next

As of today, we’re launching our private preview program. If you’re interested in being a part of our early testers and want to help us shape our streaming roadmap, apply here

During the private preview stage, we will continue to harden the existing capabilities and make improvements based on the feedback from our customers who have tried it. During public preview and post GA, we plan to integrate it with our materialized views systems allowing our customers to declaratively build pipelines to prepare the ingested data for consumption as a data product.

In the next blog of our series, we describe how we are dogfooding streaming ingest for our own internal data lake that we use for product analytics. In the third blog post, we’ll do a deeper dive into the technology and provide a more detailed step by step tutorial on how to get started. Stay tuned!