I’m reading the article Starburst Stargate: The Final Frontier in Analytics Anywhere
and see that also here is advertised the ability of Trino / Starburst Stargate to join disparate data sources together and allow gaining insights which would otherwise imply the setup of sync pipelines from the data sources (towards a monolithic data warehouse).
However, at a certain scale, joining data on the fly hits the barrier of capabilities of network communication. Say that two data sources having both multiple GB need to be joined together to answer a query. For an ad-hoc query is acceptable to join them on the fly, but for daily usage does this scenario still hold?
There are a few different ways to tackle this. Both Stargate and Starburst Views were created to assist here. First, Stargate will “push” queries down to the remote cluster reducing the data being returned and views allow you to replicate a table from any source to another connector such as the Hive one (S3, ADLS,etc…) in order to increase performance and reduce the amount of remote data you are querying. It is an optimization problem. If you are reading the full data set multiple times, then it likely makes sense to pay the cost of storing a second copy, as network bandwidth is expensive. But it’s often more complicated, where you are reading the recent data in a large table multiple times. This is where Starburst Views help to maintain the copy of the relevant data.