Starburst sql engine doubt?

emanueol · July 4, 2022, 1:28pm

I understood the idea of Starburst as sql engine able to execute sql joins against multiple databases, and theres some compute vs and/or storage own cluster.
I just would like a simple yes/no confirmation in how sql gets executed:

Does it creates execution plan generating individual sql that will be executed against remote databases, get over the network and stored temporary in Starburst/trino each sql returned rows, and respective after joins merg8ng returned data right?
What does Trino/Starburst uses to connect/execute sql against remote databases (ex. Snowflake, Oracle. …) ? Is it plain simple jdbc sql drivers? Or does it have special connectors to for example get cdc data via remote database redo logs or any sort of event trigger syn sending transactions to Trino/Starburst cluster maintaining a synchronized copy of remote database or selected tables in Trino/Starburst cluster ?
Im basically wondering if sql queries executed uding normal sql drivers than that causes pressure on remote database (legacy databases like Oracle or sql server for example) right? If this architectural design understanding of mine is true than I suppose use case for Trino/Starburst applies more to use modern data warehouses that split compute vs storage, so online business application users dont get affected right? That means snowflake, redshift would apply but not so much oracle, sql server and others monolithic legacy databases right ?

I really enjoyed the Starburst webinar, the data governance, lineage, creation of data products, ability to reduce different people/teams duplicate efforts by developing same data products/queries.

I just would like confirmation on my understanding that access to remote databases is plain simple sql drivers (probably jdbc only ? Maybe odbc also?)

Thanks in advance, and sorry if this information already exist somewhere or if posted in incorrect forum.

Please share links to webpages, videos, documentation etc that can help me understanding the basics 8n hos sql engine works, what work done by cluster etc.

Thanks,
Emanuel Oliveira

victorc · July 10, 2022, 8:08am

Hi Emanuel,

To answer some of your questions:

Starburst/Trino will get data needed for the SQL query from the remote databases in the cluster memory and only 1 execution plan is created for the cluster execution, not for remote data sources. For some datasources, sometimes we get data with SQL pushdown to only get minimum data for the compute in cluster memory. About pushdown Pushdown — Starburst Enterprise
Depends of the databases and data sources, sometimes it is via JDBC, sometimes it’s using the native API of the data source. You can find more details on each connector (with differences between Trino and Starburst version) here Starburst connectors overview — Starburst Enterprise
SQL queries won’t put pressure on the remote database as we just read data we need for the query, most of the execution and compute will be done inside the Trino cluster. So you can perfectly use it for oracle, sql server databases as we don’t ask them to do the compute (compared to other legacy data virtualization solution). The idea is to do all the compute in the cluster but also when needed, to ask the data source to do some filters or aggregation (this can be tuned).

Conclusion : the access to remote database is not every times with a JDBC driver (Trino concepts — Starburst Enterprise) , and we don’t put pressures on data sources. And if you want to optimize and put less pressure, we have things like Materialized Views (or Scan redirection) to cache remote databases in a data lake.
Starburst | Materialized views

Hope it helps
Victor

emanueol · July 10, 2022, 9:56am

Yes i understand you do the joins in your cluster after pulling data from source systems.

From the perspective of remote systems pressure, i suppose simple select … where… is only type of sql then ? So no group bys done in remotes ever ?

Example aggregation of 3 tables join with 2 tables (orders table with 100million rows(10 years) + reference table with 10k rowd in oracle and 3rd driving table in sql server with 1 billion rows).

Please confirm if this is what physically happens:

Trino prepares plan based on metadata and the one year select/where order_dt/groupby.
[THREAD(s)/SESSION(s)] executes select needed columns from oracle.order where order_dt>= 2021-01-01 and order_dt<2022-01-01
THIS WILL PULL the 10 MILLION ROWS for one year of data.
When you say theres no pressure on remote database, i agree if no joins, aggregations or window functions executed. But data still have to he pulled via network.
I can imagine either Trino uses 1 single thread/session against Oracle to execute the SELECT … WHERE one year of orders (10 million rows), or… does Trino creates multiple threads/sessions executing the SELECT pulling different sets if rows?
after network pulling of 10 million rows of selected columns on oracle.orders table+ network pull of oracle.reference table (10k) row, then the selected of one billion rows containing columns to be aggregated (sum, count, etc ) if each row have 1k we talking about executing one or more SELECTS TO pull something over the network 1 trillion bytes correct ?

So saying theres no pressure seems a light statement, at least network traffic will be huge and take time, and then depending using one or more database client sessions to pull data from oracle, even without any aggregation or cpu intensive operations still database will surely fill most of its data buffers in order to serve the query even if simple select…

Again im still rolling my eyes on what type of sql gets executes on remote databases and obviously the pressure exists always, theres bo free lunches… at least for me software engineer that understands well oracle internals etc…

Last step its the one that truly doesn’t put pressure anymore on remote databases,after all needed data pulled and stored temporary on Trino cluster, then joins and group-by can be executed on Trino and results piped to trino client (i suppose you have your own jdbc/odbc sql driver allowing to specify databases names on the sql queries ? I already saw you have a webui to manage data products and colaboration and data governance and even data lineage i think.

Am I completely right in all my understanding correct ?

Thanks in advance ~
Emanuel O