Skip to content

Task lease time should be set dynamically according to the current system load #3214

@stephanie-wang

Description

@stephanie-wang

Describe the problem

Task leases are used to determine when a task should be re-executed: if the node that is supposed to execute the task fails to renew the lease in time, then someone else may re-execute the task. Right now, the task lease time is set to a static value.

This can cause spurious reconstruction, especially in cases when there is a load imbalance in the cluster. For instance, if all nodes but 1 send tasks to a single node (e.g., because that node has some actor that everyone else talks to), then for some load, the single node will eventually not be able to acquire/renew leases in time, due to queuing delay either at the communication sockets between raylets or in the node's event loop. Other nodes should be aware of this and should adjust how long they wait for a task lease to expire accordingly.

Source code / logs

This is likely part of the problem behind #3170.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions