-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Task lease time should be set dynamically according to the current system load #3214
Description
Describe the problem
Task leases are used to determine when a task should be re-executed: if the node that is supposed to execute the task fails to renew the lease in time, then someone else may re-execute the task. Right now, the task lease time is set to a static value.
This can cause spurious reconstruction, especially in cases when there is a load imbalance in the cluster. For instance, if all nodes but 1 send tasks to a single node (e.g., because that node has some actor that everyone else talks to), then for some load, the single node will eventually not be able to acquire/renew leases in time, due to queuing delay either at the communication sockets between raylets or in the node's event loop. Other nodes should be aware of this and should adjust how long they wait for a task lease to expire accordingly.
Source code / logs
This is likely part of the problem behind #3170.