Downtime failover estimates

We are using Starburst High Availability(HA) via CloudFormation. Do we have an estimate on the downtime when failover happens on the Coordinator/Worker in events like update or node failure?

It’s really two things:

  1. Recognize one of the nodes is down. This may take a min or two.
  2. AWS spin up a new EC2 instance. This could then take 1-4min then the bootup of the node/software and register with the coordinator. I’ve seen 5 to 14min or so of total downtime.

Also something to note, when multiple coordinators are started upfront, the failover is much faster (but you pay for the EC2 machine running as a passive Coordinator) . Failure of a Worker does not cause an outage since the cluster is still accepting queries. The speed of the queries is a bit lower since you have a smaller capacity until the replacement Worker is provisioned.

Some of the fault tolerant execution work should help with the task restarts and help reduce downtime to end users.