Introducing support for UNLOAD in Starburst Galaxy

Streamline data management through seamless file writing without table creation in Starburst Galaxy

Share

In today’s data-driven landscape, data professionals are tasked with efficiently managing and processing vast amounts of data. One recurring challenge in this process involves the constraints around outputting query results in formats that seamlessly integrate with downstream applications.

Organizations often grapple with inefficient mechanisms for compressing these query results, leading to inflated storage costs and compromised performance in data processing pipelines. Similarly, data science teams encounter complexities in directly feeding query outputs into machine learning models without time-consuming table creation processes.

Tackling these obstacles head-on, Starburst is excited to introduce the UNLOAD table function—a transformative solution offering flexibility in output formats, compression options, and direct data feeds for downstream consumption.

UNLOAD table function

The UNLOAD table function is a pre-built table function within the system schema, designed to efficiently handle data output operations. By default, the function is deactivated through access control, requiring the assignment of necessary privileges to execute the UNLOAD function and granting location privileges for the output destination.

The input parameter can accept either a table name or a SELECT query. The separator parameter is applicable only when the format argument is set to CSV or TEXTFILE.

See documentation for further details.

SELECT * FROM TABLE(system.unload(
input => TABLE(...) [PARTITION BY col (, ...)], 
location => '?', 
format => '?'
[, compression => '?']
[, separator => '?']
[, header => true|false]
))

Consider writing files as text file format with `|` delimiter and gzip compressed:

This query generates a list of file paths, row counts, and partition columns. The PARTITION BY clause influences the directory structure of the results. When omitted, the function generates files within the designated directory without creating subdirectories. In this example, the file structure mirrors the format of a Hive table, not Iceberg or Delta Lake. 

Decompressing and Examining Results

Let’s decompress a gzip file and examine the results under the orderdate=1992-01-11/orderstatus=F partition. The partition names and values are exclusively present in the directory names. The generated files do not contain the partition names and values.

18275|206
4998|319
34662|760
39648|653
17540|866
44033|1370
20708|1009
10053|1441

Limitations

The UNLOAD table function allows writing in CSV or TEXTFILE formats, directly to all supported storages such as AWS S3, Azure Data Lake Storage, and Google Cloud Storage. It’s important to note that we are currently in the experimental stage with this functionality. Presently, the function produces a solitary file for either an entire table or a specific partition and does not accommodate batch cluster types. Each format has its own set of constraints. For example, the CSV format exclusively supports VARCHAR columns, and Avro files do not permit special characters in the column names.

To learn more about the UNLOAD table function, see our documentation.