-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Core][Bug] ray nodes get into a bad state and actor can't be scheduled #19207
Copy link
Copy link
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Milestone
Description
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
getting some more data on at least one form of “ray nodes get into a bad state”. it looks like for some reason my node’s IP resource is being considered full when the node is in fact idle:
2021-10-06 23:15:51,415 WARNING worker.py:1231 -- The actor or task with ID XXX cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {node:XX: 0.056200}
Available resources on this node: {X/X CPU, X GiB/X GiB memory, X/X GPU, X GiB/X GiB object_store_memory, 1.000000/1.000000 node:XX}
and when i do ray.available_resources() i do see 1.0 for that node’s IP available. ray status shows the actors destined for that node as pending.
Reproduction script
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't