We are using Starburst High Availability(HA) via CloudFormation. Do we have an estimate on the downtime when failover happens on the Coordinator/Worker in events like update or node failure?
It’s really two things:
- Recognize one of the nodes is down. This may take a min or two.
- AWS spin up a new EC2 instance. This could then take 1-4min then the bootup of the node/software and register with the coordinator. I’ve seen 5 to 14min or so of total downtime.
Also something to note, when multiple coordinators are started upfront, the failover is much faster (but you pay for the EC2 machine running as a passive Coordinator) . Failure of a Worker does not cause an outage since the cluster is still accepting queries. The speed of the queries is a bit lower since you have a smaller capacity until the replacement Worker is provisioned.
Some of the fault tolerant execution work should help with the task restarts and help reduce downtime to end users.