Query Acceleration
Query acceleration is essential to optimizing SQL query performance on large, petabyte-scale datasets, controlling compute costs, and meeting analyst expectations for SQL response times measured in minutes, seconds, or milliseconds. However, designing efficient and performant queries requires a particular set of data engineering skills many companies don’t have.
In this article, we will review how Starburst’s query acceleration features automatically balance the competing demands of query performance and cost, freeing engineers to focus on higher-priority tasks.
Types of query acceleration for your data lake analytics
Running queries is more than applying a SQL statement to a dataset. Starburst’s data lake analytics platform uses a six-stage procedure that prepares a query to run quickly and efficiently:
- Parse: Turns the SQL statement into a syntax tree and validates its grammar and logic.
- Analyze: Evaluates the syntax tree to validate SQL semantics against metadata about tables, columns, and types.
- Plan: Creates an execution plan based on the syntax tree that provides a first-pass description of the query processing workflow.
- Optimize: Modifies the initial plan with semantics that optimize the query for cost, performance, and other metrics.
- Schedule: Distributes tasks to worker nodes, balancing workloads to avoid disrupting other operations.
- Execute: Monitors progress as worker nodes execute the query plan, adjusting the plan as needed to ensure reliable execution.
Starburst applies the following query acceleration techniques at various points in these query preparation stages to deliver warehouse-like performance directly on your data lake.
Autonomous indexing
Autonomous index acceleration automatically chooses the index type for a specific column and adds the resulting index to the row group to minimize table scans. Indexing improves over time as the optimizer evaluates query patterns.
Since the system can index any column, this technique improves query performance in use cases like fraud detection and customer 360 analytics that must filter multi-dimensional datasets.
In addition, data teams without deep expertise in query optimization can use this autonomous feature to improve query performance without impacting their other responsibilities.
Text search indexing acceleration
Common text search use cases, such as log analysis or pattern matching, require scanning billions of rows. This is one of the reasons why data warehouse sizes and expenses keep growing. Dedicated text analytics applications are poor solutions to this data management problem. Handling search within a query is a more cost-effective approach — provided you can minimize the amount of data the query must touch.
Starburst’s text search acceleration continuously analyzes and indexes data lake content, using Apache Lucene to index text columns. Full support for LIKE semantics lets engineers write SQL statements to find a value’s exact occurrence in a filtered column.
Data caching
Data cache acceleration copies frequently or recently accessed data from the data lake’s object storage as nanoblocks in high-performance solid-state drives. This cache attaches to the query’s cluster, letting workers access this low-latency source rather than going out to the data lake.
Nanoblocks vs. micro-partitions
Micro-partitioning and traditional indexing are commonly used to improve query performance. Micro-partitioning creates blocks of data based on a partitioning key, allowing the query engine to identify blocks containing relevant data. However, this approach is still read-intensive. Traditional indexing approaches are row-optimized. They must read every row to find each column’s data, imposing a significant hit to storage and compute.
Starburst’s proprietary nanoblock indexing and caching technology creates millions of kilobyte-sized columnar nanoblocks, each containing optimized indexes to a small subset of the total data. Workers can quickly read these nanoblocks with less overhead than traditional indexes and micro-partitions. In addition, Starburst’s caching accelerators can dynamically update these nanoblocks without touching data tables.
Materialized views
Database views present the results of a query without creating a new table or changing the database’s schema. Besides avoiding additional storage costs, views offer more flexibility for simplifying data presentations and controlling access. However, the database runs the view’s associated query to refresh the view’s data every time a user accesses the view. Multiple users repeatedly accessing the view requires additional resources and drives compute costs higher, especially when the underlying query is complex.
Materialized views store their queries’ results, trading slightly higher storage costs for a significant reduction in compute. Starburst’s material views work with any object storage catalog and table format, presenting the results in the Iceberg table format for easy access by business intelligence platforms. These materialized views will refresh their data automatically based on schedules set by the data team.
Query optimization
As Starburst prepares a query for execution, it uses various techniques to optimize the query engine’s performance.
1. Table statistics
A suite of over fifty connectors federate enterprise data sources within Starburst’s virtualized access layer. The specific connector and the source’s table schema will determine which of the following statistics the query planner receives:
- Table statistics:
- Row counts
- Column statistics:
- Data size
- Nulls fraction
- Distinct value count
- Low value
- High value
2. Managed statistics
Insufficient statistics from the data source prevents the query planner from making the best decisions, making queries inefficient and more expensive.
Starburst’s managed statistics feature collects stats from these uncooperative sources and stores them as metadata. The query planner can tap into this metadata to get the statistics it needs to develop an effective query plan that optimizes performance and resource consumption.
Managed statistics also makes data more accessible by giving users the information they need for exploration and discovery.
3. Pushdown
Rather than processing queries within a Starburst cluster, pushing queries down to the data source can boost performance while reducing costs. Trino supports seven types of pushdown queries:
- Predicate: Tells the data source to access specific rows.
- Projection: Tells the data source to access specific columns
- Dereference: Tells the data source to read a single field in data rows.
- Aggregation: The source groups and aggregates data before passing it to the connector.
- Join: Lets a source return a joined table rather than copying multiple tables to the cluster.
- Limit: Returns a small subset of a large unsorted dataset.
- Top-N: returns a small subset of a large sorted dataset.
Cost-based optimizations
Starburst’s cost-based optimizer uses table statistics, query syntax, and filters to automatically balance cost with processor, memory, and network performance. It does this by calculating the cost of each operation across multiple versions of the query execution plan.
The system will choose the variant that best optimizes network and compute resources. The query’s creator can specify the “right” balance between cost and performance within Starburst’s interface.
Distributed query engine
Starburst sits between the data consumption and enterprise storage layers, federating a company’s many data sources and giving consumers a single point of access for their analytics workloads.
Many of the query acceleration techniques discussed above are part of the Trino query engine. This open-source technology allows the massively-parallel processing of SQL queries on petabyte-scale datasets.
Starburst enhances Trino’s performant, cost-effective big data analytics with additional optimizations like Starburst Warp Speed’s Smart Indexing and Caching technologies.
How Starburst helps
Many Starburst customers are reducing their reliance on monolithic data warehouses like Snowflake because a federated architecture solves many of the warehouse model’s weaknesses.
Starburst’s federated approach lets you query data where it lives. By comparison, populating a warehouse’s centralized data repository requires copying vast amounts of data from various sources at considerable expense. Moreover, Starburst lets you query any data type, not just the structured data of a warehouse.
Warehouse vendors like Snowflake couple storage and compute within pricing plans that make scaling expensive. Starburst-based data lakehouses operate on commodity object storage with dynamic, affordable scalability.
Using Starburst to build an open data lakehouse increases optionality — you can choose the best open table and data formats for your use cases — over a warehouse’s proprietary structure. This optionality also eliminates the burden of vendor lock-in.
Customers have seen tenfold improvements in query performance after migrating data to Starburst from Snowflake. At the same time, Starburst’s support for ANSI-standard SQL makes enterprise data more accessible to experienced and non-technical data consumers.