Skip to content

Marking a plasma manager as dead does not mark its local scheduler as dead. #569

@robertnishihara

Description

@robertnishihara

The file monitor-008015.err on the head node looks like this.

WARNING:root:Timed out b'plasma_manager'
WARNING:root:Removed b'plasma_manager', client ID 00fb29d393f227ce044542f05065560325fb72fd
WARNING:root:Marked 1274 objects as lost.

The entry of ray.global_state.client_table() for this node is the following.

'172.31.30.57': [
  {'ClientType': 'plasma_manager',
   'DBClientID': '00fb29d393f227ce044542f05065560325fb72fd',
   'Deleted': True},
  {'AuxAddress': '172.31.30.57:11227',
   'ClientType': 'local_scheduler',
   'DBClientID': '46139b8d82494ce2480dfd37d98b05fea6da1984',
   'Deleted': False,
   'LocalSchedulerSocketName': '/tmp/scheduler40743926',
   'NumCPUs': 8.0,
   'NumGPUs': 0.0}]

So the plasma manager has been marked as dead, but the local scheduler on the same node has not.

When I run new workloads, it looks like tasks are scheduled on the node with the "dead" plasma manager. Note that when I run `ps aux | grep "plasma_manager " on the relevant node, the manager seems to still be alive.

What is the intended behavior here. If Ray thinks that the manager is dead, then shouldn't we stop assigning work that node?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions