Considering swapping to Trino from Dremio, some questions

TrinoSlackHighlights · August 9, 2023, 2:15pm

We’re currently using Dremio to fulfill our data lake querying needs and use it to ingest data from CSVs and Parquets on S3.

We’re running into some constraints and thinking about switching to a tool like Trino but had a few questions we wanted answered beforehand.

Issues with Dremio:

The internal database keeps getting full and requires a restart of the whole Dremio deployment to clear. Dremio uses RocksDB internally to keep track of some metadata and logs and that RocksDB deployment is not scalable so it can quickly bloat and fill a disk of 150GB+ if the logs aren’t cleared on time

Does Trino have a similar internal implementation on either the FE node or BE node that is not configurable and would require a regular restart to clear some kind of metadata or cache that could get filled and isn’t scalable?

Flexible support for reading Parquet files from S3.

On some instances when we try to parse Parquet files into Dremio we face type issues and incompatibilities

Cost savings for long-running cluster

We have a long running Dremio cluster that’s always available to service client requests, and as such, we end up racking up a huge EC2 bill for beefy machines. If there’s any way to get similar performance at a reduced cost, that would be a significant factor in our decision to convert.

TrinoSlackHighlights · August 9, 2023, 2:17pm

Answer:

#1 Trino doesn’t have any issues with disks getting filled quickly.

#2 While it depends on the specific compatibility issues, in general, Trino has very good connectivity to data lakes.

#3 This may be a good opportunity to consider Starburst Galaxy, which can auto resume/suspend your clusters? Or, you can consider AWS Athena, which is a serverless service. But Athena’s price to performance ratio is worse.