Skip to content

[Core][Bug] ray nodes get into a bad state and actor can't be scheduled #19207

@scv119

Description

@scv119

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

getting some more data on at least one form of “ray nodes get into a bad state”. it looks like for some reason my node’s IP resource is being considered full when the node is in fact idle:

2021-10-06 23:15:51,415	WARNING worker.py:1231 -- The actor or task with ID XXX cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {node:XX: 0.056200}
Available resources on this node: {X/X CPU, X GiB/X GiB memory, X/X GPU, X GiB/X GiB object_store_memory, 1.000000/1.000000 node:XX}

and when i do ray.available_resources() i do see 1.0 for that node’s IP available. ray status shows the actors destined for that node as pending.

Reproduction script

https://anyscaleteam.slack.com/archives/C027L220V0V/p1633638823184200?thread_ts=1633562366.168000&cid=C027L220V0V

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions