Detecting stuck Trino worker pods (kubernetes)

Hi,
We are running trino on vanilla kubernetes with EC2 spots (official Trino image).
i have noticed that sometimes pods get stuck, the queries are not always sending internal errors because they reach a max runtime of 4 minutes.
but it is clear that the query is hanging on the 99%-100% because a single pod didn’t finish one of the stages while other pods finished in several seconds.

  1. Is there a good/efficient way to prevent this?
  2. If not, is there a way to detect those pods (not manually) and kill them?

Thanks,
Aviv

Hey @aviv, I’m looking for some answers here. Have you tested on regular EC2 instances and seen the same behavior?

Hey @bitsondatadev ,
are you referring to “on demand” vs spots?
we have regions where we use on demand due to lack of spot instances,
but the load on the clusters is much smaller.

the same machine specs are being used in both OD and spots which are regular EC2 instances:
96 vcpu, 192GB RAM, 25gb network

This is interesting. Have you been able to manually detect such a failing pod so far ? If so, did we check the events section by describing the pod and seeing if there was a k8s related error that occurred ?
If the query or the pod where it’s running is getting stuck, I’d try to look at the logs for the pod, then the container and then the Trino worker node itself to gather some more information about the issue.
That should help me figure out a starting point to attack this problem and eventually plan to prevent this in the future.

Here is an example command to describe a pod:

kubectl describe pod nginx-deployment-1006230814-6winp