Build a file ingest pipeline

17 mins remaining

1. Tutorial overview

Last Updated: 2025-02-10

Background

Starburst is a powerful data lake analytics platform that enables you to discover, govern, and query data, whether it resides in a data lake or elsewhere.

Today, organizations recognize the importance of leveraging all their data to make informed business decisions. JSON is one of the most widely used semi-structured file formats for batch data delivery but it wasn't designed with being used for analytics in mind. The Iceberg Table format along with parquet file format was designed with analytics in mind. The process for transforming JSON files into Iceberg tables and maintaining a continuous data pipeline can be complex and time consuming. While third-party applications exist to address this challenge, they are standalone solutions that require manual integration and can be costly to maintain.

To solve this problem, our team at Starburst introduced a micro-batch file ingest feature as a core part of our platform. This capability is a fundamental pillar of Icehouse architecture, as outlined in the Icehouse Manifesto.

By using Starburst Galaxy for batch file ingestion, data engineers gain a seamless, efficient solution that ensures organizations can realize maximum value from their data.

Scope of tutorial

In this tutorial, you will take on the role of a Data Engineer to build out a batch file ingest pipeline.

You will complete the following steps:

  1. Copy sample NDJson files to your S3 bucket.
  2. Configure a connection in Galaxy to monitor the S3 bucket for new files.
  3. Set up an iceberg format live table to continuously ingest your data.
  4. Complete a simple SQL statement on the live table data.

Prerequisites

  • You need a Starburst Galaxy account to complete this tutorial. Please see Starburst Galaxy: Getting started for instructions on setting up a free account.
  • You need an AWS S3 catalog in Starburst Galaxy that connects to an S3 bucket location for which you have read and write access. This catalog must be in the us-east-1 region. It must also be connected via a Cross Account IAM role.

Learning outcomes

Once you've completed this tutorial, you will be able to:

  • Configure batch file ingest in Starburst Galaxy.

About Starburst tutorials

Starburst tutorials are designed to get you up and running quickly by providing bite-sized, hands-on educational resources. Each tutorial explores a single feature or topic through a series of guided, step-by-step instructions.

As you navigate through the tutorial you should follow along using your own Starburst Galaxy account. This will help consolidate the learning process by mixing theory and practice.

2. Use Starburst Galaxy to configure file ingest

Working with NDJSON in S3

In this section, you'll start by copying an NDJSON (*.ndjson) file into your S3 bucket.

Newline-Delimited JSON (NDJSON) is a format where each line in a file or stream is a separate, valid JSON object. This structure makes it easier to process large datasets line by line, as opposed to standard JSON, which requires parsing the entire file at once—a challenge when handling large-scale big data files. NDJSON is specifically designed to streamline the processing of massive datasets.

Why Use NDJSON?

APIs, sensors, and other data sources commonly deliver data in JSON format. However, converting JSON to NDJSON is a routine task in data engineering, as many workflows require it for efficient data processing. If you're new to this conversion, free online tools like Konbert can assist.

Converting JSON to NDJSON with Python

One of the most efficient ways to perform this transformation is by using Python's ndjson module. The following minimal script reads a JSON file and outputs an NDJSON file:

import json, ndjson

with open("flights.json") as f, open("flights.ndjson", "w") as o:
    ndjson.dump(json.load(f), o)

Handling Nested JSON in API Responses

Many API responses contain a structured JSON payload where the relevant data is nested under a "data" key. To extract only the necessary information, modify the script to include ["data"] when loading the JSON:

import json, ndjson

with open("flights.json") as f, open("flights.ndjson", "w") as o:
    ndjson.dump(json.load(f)["data"], o)

This small adjustment ensures that only the essential data is written to the NDJSON file, optimizing it for downstream processing.

Try It Yourself

To experiment with this process, use the provided flights.json file in this lesson and test the conversion code on your own system.

Step 1: Create a schema

When you create your live table, you'll be required to select a catalog and schema for the table. You should already have the catalog set up. This step will show you how to create a schema within that catalog.

  • Log in to your Starburst Galaxy account.
  • Click to expand the left-hand navigation menu, and select Query>>Query editor.
  • Run the following SQL to create the schema. Be sure to replace with your catalog name.
CREATE SCHEMA <your-catalog>.flight_data;

Step 2: Copy the flights.ndjson file to your S3 bucket

You can download the flights.ndjson file here.

  • Select or create an S3 bucket in region us-east-1.
  • Create a new directory called file-ingest in your S3 bucket.
  • Create a directory called flight-data within the file-ingest directory.
  • Create a directory called flights within the file-ingest directory.
  • Upload the flights.ndjson file into the flights directory.

Step 3: Begin data ingest connection

  • Return to Starburst Galaxy.
  • From the left-hand navigation menu, select Data>>Data ingest.
  • Click the Connect new source button.

Step 4: Select source

Notice that you have four options for source. More connection options are planned in future Starburst Galaxy updates.

  • Select the Amazon S3 tile.
  • Click Next.

Step 5: Add source details

  • Provide a meaningful Source name (ex. file_ingest_loc_flight_data).
  • Provide a meaningful Description.
  • Enter the name of your S3 bucket.
  • Under S3 file prefix, enter the directory prefix where your flight data is located (ex. file-ingest/flight-data). It is important to ensure you don't include leading or trailing forward slashes. Those are added automatically for you.
  • Notice that the S3 path is being built and displayed under the boxes.
  • Under Authenticate with * select the radio button for the method you wish to use. Remember today you must use a Cross account IAM role to use the catalog with file ingest.
  • Click the Test connection button.
  • Confirm you receive the message Hooray! Able to successfully connect.
  • Click Save new source.
  • Click Yes, create a live table.

Step 6: Configure live table

Now you're going to configure a live table to ingest data from the flights/flights.ndjson file.

  • Use the Catalog dropdown menu to select your catalog. Recall that this must be the catalog you chose for the schema you created earlier.
  • Use the Schema dropdown menu to select your schema.
  • Provide a meaningful Table name. We suggest you use the directory name (ex. flights).
  • In the box under Enter source S3 suffix, type the name of the directory that contains your NDJson file (ex. flights). After you provide the suffix, the prefilled example below the box will update to include the directory suffix.
  • Select your desired polling interval. Galaxy will continuously read from the directory on this interval looking for new NDJson files to ingest.
  • Scroll down, and click Test connection.
  • Confirm you receive the message Hooray! Able to successfully connect.
  • Click Map columns.

Step 7: Upload record sample

On the left side of the screen, you'll find the Edit Column Mapping section. As you scroll down, you'll notice that the Data Type field is blank for /aircraft and /live.

Galaxy determines column types by sampling the first row of the NDJSON file. However, like many real-world datasets, the provided file contains NULL values in some columns. When a column has a NULL value in the first row, its data type cannot be automatically determined.

To resolve this, you must provide a Record Sample—an example row where all columns contain actual values. You can create this by copying the first row of your file and replacing each NULL with a realistic value.

This Record Sample is used only for defining data types and will not be included in the ingested data.

Below is the Record sample you'll be using in this section.

{"flight_date": "2024-12-08", "flight_status": "active", "departure": {"airport": "Auckland International", "timezone": "Pacific/Auckland", "iata": "AKL", "icao": "NZAA", "terminal": "I", "gate": "10", "delay": 19, "scheduled": "2024-12-08T00:55:00+00:00", "estimated": "2024-12-08T00:55:00+00:00", "actual": "2024-12-08T01:14:00+00:00", "estimated_runway": "2024-12-08T01:14:00+00:00", "actual_runway": "2024-12-08T01:14:00+00:00"}, "arrival": {"airport": "Singapore Changi", "timezone": "Asia/Singapore", "iata": "SIN", "icao": "WSSS", "terminal": "1", "gate": "B4", "baggage": "D06", "delay": 100, "scheduled": "2024-12-08T06:30:00+00:00", "estimated": "2024-12-08T06:30:00+00:00", "actual": "2024-12-08T06:30:00+00:00", "estimated_runway": "2024-12-08T06:30:00+00:00", "actual_runway": "2024-12-08T06:30:00+00:00"}, "airline": {"name": "Singapore Airlines", "iata": "SQ", "icao": "SIA"}, "flight": {"number": "282", "iata": "SQ282", "icao": "SIA282", "codeshared": {"airline_name": "singapore airlines", "airline_iata": "sq", "airline_icao": "sia", "flight_number": "282", "flight_iata": "sq282", "flight_icao": "sia282"}}, "aircraft": {"registration": "9V-SMD", "iata": "A359", "icao": "A359", "icao24": "76CDA4"}, "live": {"updated": "2024-12-07T19:45:21+00:00", "latitude": -5.0164, "longitude": 116.262, "altitude": 12192, "direction": 297, "speed_horizontal": 964.892, "speed_vertical": 0, "is_ground": false}}
  • Copy the JSON Record sample above.
  • Click the upload arrow button to the right of Record sample.
  • Delete all the data in the Record Sample box.
  • Paste what you copied above into the Record Sample box.
  • Click Load sample.
  • The data types for /aircraft and /live have been updated to ROW.

Step 8: Map to columns

Starburst Galaxy automatically generates a suggested table schema and column names by inferring them from the JSON keys. You can edit these if needed.

Additionally, you have the option to Unnest columns that contain nested data. The example file provided includes JSON arrays nested within other JSON arrays. If you choose to keep these nested, you'll need to use dot notation (ex. column.key) to query the data, which is a valid approach if that suits your needs. In this lesson you'll run SQL on both nested and unnested columns so you can see the differences between them.

For this exercise, we'll guide you through Unnesting one of the columns so you can learn how the process works.

  • Click Unnest under /aircraft.
  • Four new columns will appear: /aircraft/registration, /aircraft/iata, /aircraft/icao, and /aircraft/icao24.
  • Each newly created column is named after the last portion of its original key. If this is the only column you Unnested, the default names would work. However, some of these column names also appear in other nested structures, which could cause naming conflicts.
  • To ensure all column names remain unique, you'll need to prefix each one with "aircraft_". For example, rename registration to aircraft_registration. Apply this same prefix to the remaining three columns.
  • Click Create schema.
  • What for the Live table status to transition from Starting to Running.

3. Query live table

Background

Now that you've completed your batch file ingest setup, it's time to query the data! It may take a few minutes before the data initially lands. New data will be continuously written to the table based on the interval you set earlier.

Step 1: Explore new table and view

It's time to head back to the Query editor. We'll begin by expanding the flight_data schema to see both a new table and view.

  • Navigate to the Query editor.
  • The catalog explorer is on the left side of the Query editor. Expand the catalog you are using for batch file ingest, then expand the schema that you created for this tutorial.
  • You should see a table and a view listed under the schema. The view is denoted by an eye icon. In our example, the table is called flights__raw, while the view is called flights.

Step 2: Learn about the raw table

The raw table includes the parsed file along with additional metadata information. This metadata information is useful for debugging either by the customer or Starburst support.

The following is a breakdown of the raw table columns and their descriptions:

$system: A unique ID for each row plus the ingested timestamp.

raw_file_contents: The raw JSON row from the file.

source_file_path: The complete S3 URI to the file used to ingest the row.

source_file_size: The size of the file at the time the row was ingested.

source_file_last_modified: Timestamp for the modified date of the file when the row was ingested.

source_file_split_offset: Denotes any offset used to ingest the row.

source_file_row_position: The row number from the raw file matching the row in the table. This is the column the view is built over.

Step 3: Learn about the view

The view is the table name you chose during the configuration. It is a view on top of the raw table to hide the metadata information.

The following is a breakdown of the view columns:

$system: A unique ID for each row plus the ingested timestamp.

The remaining columns come from the parse messages. They are derived from the parsed column in the raw table.

Step 4: Query the view

Now it's time to see how easy it is to query the ingested data.

  • Click the ellipses next to the view, and select the menu option to SELECT * FROM the view with LIMIT 10.
  • Run the query.

Step 5: Query the structured row columns

When ingesting this file, we chose not to unnest all columns containing nested data. One such column is departure.

Now, suppose you want to retrieve a list of all departure airports. To do this, you'll need to reference the airport key within the departure column. Fortunately, this is a simple process.

  • Set your cluster, catalog and schema at the top of the SQL editor.
  • Copy the following SQL into the editor and click Run.
SELECT
  DISTINCT departure.airport AS airport
FROM
  flights;

Notice, you can refer to the keys within the column by adding a period followed by the key name (ex. .airport).

4. Tutorial wrap-up

Tutorial complete

Congratulations! You have reached the end of this tutorial, and the end of this stage of your journey.

Now that you've completed this tutorial, you should have a better understanding of just how easy it is to use batch file ingest in Starburst Galaxy. Stay tuned for more tutorials using this feature!

Continuous learning

At Starburst, we believe in continuous learning. This tutorial provides the foundation for further training available on this platform, and you can return to it as many times as you like. Future tutorials will make use of the concepts used here.

Next steps

Starburst has lots of other tutorials to help you get up and running quickly. Each one breaks down an individual problem and guides you to a solution using a step-by-step approach to learning.

Tutorials available

Visit the Tutorials section to view the full list of tutorials and keep moving forward on your journey!

Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes. For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages.

Functional/Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites.