Object storage fundamentals
Data storage services like AWS’s Amazon S3 rely on commodity cloud storage to house vast quantities of data cost-effectively. They store data as objects in a flat structure to maximize storage utilization. Unique object identifiers and metadata for queries help queries search for data without imposing a predefined schema on the data. As a result, object storage can handle any data types or data formats.
However, this approach can quickly become unmanageable at scale. Open table and file formats bring the additional structure needed to turn S3’s raw data storage into a data lake for enterprise analytics use cases. Many of these formats are part of the Hadoop ecosystem of data management applications.
File formats
Raw data may arrive at a data warehouse or data lake in a text-based JSON or CSV file format, but that doesn’t mean it has to be stored in that format. More often than not, it shouldn’t. Open file formats are optimized for efficient data storage and retrieval in large, distributed systems.
Column-oriented file formats
A columnar storage format is the best choice for big data analytics platforms where data gets written once but read multiple times. Structuring files around columns lends itself to efficient processing since queries tend to look at columnar data, like finding a specific date.
The Parquet format stores columnar data sequentially, applying the most efficient compression algorithms — such as GZIP, LZO, or Snappy — and encodings — such as run-length encoding (RLE), bit packing, or dictionary encoding — for the data in each column. Parquet files support complex nested data structures that further reduce how much data queries must read.
The optimized row columnar (ORC) file format was developed to provide a more efficient alternative to Hadoop’s RCFile format. As such, ORC tends to be found in Hadoop applications that use the Hive table format.
Row-based formats
Operational systems and transactional databases place a higher premium on write performance than do analytics platforms. A format that structures data files so entire rows are stored together makes these systems much more efficient and performant.
Avro is the most common row-based format in the Hadoop ecosystem. This standard stores data in binary format but keeps the file’s schema in easily readable JSON to maximize compatibility. Avro is particularly good at supporting schema evolution to minimize the amount of data that must be rewritten when fields are added, deleted, or changed.
Starburst and our Great Lakes Connector
Starburst is the enterprise implementation of the Trino query engine. Building upon the open-source project’s core capabilities, Starburst adds connectivity, performance, and cost optimizations, allowing companies to develop an open data lakehouse analytics platform that integrates all enterprise data sources without moving data.
Making object storage architectures transparent
One way Starburst federates enterprise data architectures is through the Great Lakes connectivity feature, which connects Starburst to a storage location rather than a data catalog. Great Lakes automatically reads any of the Iceberg, Delta Lake, Hive, or Hudi table formats it finds in that location, along with the file formats used in those tables:
- Apache Iceberg: Parquet, ORC, and Avro
- Delta Lake: Parquet
- Apache Hive: Parquet, ORC, JSON, CSV, and more.
- Apache Hudi: Parquet
Note that Parquet is the one file format shared across the various table formats. That is one reason behind Parquet’s growing popularity.
By abstracting the details about each location’s table and file formats, Starburst lets users access data without knowing the intricacies of the underlying object storage configuration. Users no longer need to choose different catalogs for different table formats. They simply access the data by issuing a SQL select query against the table. Even better, they can easily write queries that join data from different table formats.
Reading Parquet files faster
The Trino project has continually improved its Parquet reader to increase read throughput. Complex datasets with hundreds of nested fields can slow queries considerably by forcing the query to scan the entire column even though it only needs to see a few fields. This unnecessary work consumes more compute resources while increasing query latency. Trino’s Parquet reader uses dereference pushdown and other techniques to limit reads to the subset of relevant data.
While Trino has made significant performance improvements over the years, Starburst took things a step further by developing a new Parquet reader to increase speed while reducing compute costs. Average query performance is 20% faster.
Starburst delivers additional performance improvements through dynamic filtering. Besides pruning buckets and skipping data segments within the data file, Starburst will apply selective filters at the row level to significantly reduce computer and network resource consumption.
Improving ELT data processing
Trino’s original purpose was to make ad hoc and interactive analytical queries faster and more efficient. By enabling faster Parquet read times and unified table and file access, Starburst extends its underlying Trino capabilities to streamline ELT data processing workflows. Great Lakes connectivity and Starburst’s suite of over fifty enterprise storage system connectors lets engineers query multiple sources from a single SQL statement. These and other Starburst features make data pipelines faster to develop, easier to maintain, and more performant.