We’re currently using Dremio to fulfill our data lake querying needs and use it to ingest data from CSVs and Parquets on S3.
We’re running into some constraints and thinking about switching to a tool like Trino but had a few questions we wanted answered beforehand.
Issues with Dremio:
- The internal database keeps getting full and requires a restart of the whole Dremio deployment to clear. Dremio uses RocksDB internally to keep track of some metadata and logs and that RocksDB deployment is not scalable so it can quickly bloat and fill a disk of 150GB+ if the logs aren’t cleared on time
Does Trino have a similar internal implementation on either the FE node or BE node that is not configurable and would require a regular restart to clear some kind of metadata or cache that could get filled and isn’t scalable?
- Flexible support for reading Parquet files from S3.
On some instances when we try to parse Parquet files into Dremio we face type issues and incompatibilities
- Cost savings for long-running cluster
We have a long running Dremio cluster that’s always available to service client requests, and as such, we end up racking up a huge EC2 bill for beefy machines. If there’s any way to get similar performance at a reduced cost, that would be a significant factor in our decision to convert.