-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Description
The file monitor-008015.err on the head node looks like this.
WARNING:root:Timed out b'plasma_manager'
WARNING:root:Removed b'plasma_manager', client ID 00fb29d393f227ce044542f05065560325fb72fd
WARNING:root:Marked 1274 objects as lost.
The entry of ray.global_state.client_table() for this node is the following.
'172.31.30.57': [
{'ClientType': 'plasma_manager',
'DBClientID': '00fb29d393f227ce044542f05065560325fb72fd',
'Deleted': True},
{'AuxAddress': '172.31.30.57:11227',
'ClientType': 'local_scheduler',
'DBClientID': '46139b8d82494ce2480dfd37d98b05fea6da1984',
'Deleted': False,
'LocalSchedulerSocketName': '/tmp/scheduler40743926',
'NumCPUs': 8.0,
'NumGPUs': 0.0}]
So the plasma manager has been marked as dead, but the local scheduler on the same node has not.
When I run new workloads, it looks like tasks are scheduled on the node with the "dead" plasma manager. Note that when I run `ps aux | grep "plasma_manager " on the relevant node, the manager seems to still be alive.
What is the intended behavior here. If Ray thinks that the manager is dead, then shouldn't we stop assigning work that node?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels