Skip to content

[Autoscaler][v1] Autoscaler launches extra nodes despite fulfilled resource demand #52864

@nadongjun

Description

@nadongjun

What happened + What you expected to happen

Summary:

We have set up Ray with a custom external node provider and are running the autoscaler as a separate process. The goal is for the autoscaler to detect resource demands from a Ray Serve deployment and request a specific node type from the external node provider accordingly.

Configuration:

  • The cluster config defines multiple node types, including ray.worker.4090.standard, ray.worker.4090.highmem, and ray.worker.4090.ultra.
    • A Ray Serve deployment requests the following resources:

autoscaler-config.yaml

available_node_types:
  ray.worker.4090.standard:
    min_workers: 0
    max_workers: 5
    resources: {"CPU": 16, "GPU": 1, "memory": 30107260928 , "gram": 24 }
    node_config: {}

  ray.worker.4090.highmem:
    min_workers: 0
    max_workers: 5
    resources: {"CPU": 16, "GPU": 1, "memory": 62277025792 , "gram": 24}
    node_config: {}

  ray.worker.4090.ultra:
    min_workers: 0
    max_workers: 5
    resources: {"CPU": 32, "GPU": 1, "memory": 130997290496 , "gram": 24 }
    node_config: {}

serve code

# for ray.worker.4090.standard
@serve.deployment(ray_actor_options={"num_cpus": 16, "num_gpus": 1, "memory": 30107260928 ,"resources": {"gram": 24}})
def CustomResourceTask(*args):
    return "ray.worker.4090.standard"


serve.run(CustomResourceTask.bind())

print("Requested additional resources...")

Expected Behavior:

  • When the Python code is executed, the autoscaler detects the demand and requests a single node of type ray.worker.4090.standard({"num_cpus": 16, "num_gpus": 1, "memory": 30107260928 ,"resources": {"gram": 24}}) from the node provider.
  • Only that node type should be launched, as it satisfies the resource requirements.

Actual Behavior:

  • The autoscaler does initially request a ray.worker.4090.standard node as expected.
  • However, immediately after, it also sends a launch request for a higher-spec node type (ray.worker.4090.highmem), even though the demand was already satisfied by the standard node.
    • When the deployment directly requests the highest-spec node type (e.g., ultra), two nodes are requested by the autoscaler.

Observation:

  • This issue does not occur if the appropriate node (ray.worker.4090.standard) already exists in the cluster before the actor is scheduled.
  • In that case, the actor gets scheduled correctly, and no additional autoscaling occurs.

Could this be a race condition between the autoscaler and the node provider, where the updated resource availability has not yet propagated before the scheduler makes a decision?
Or is it a scheduling policy issue, where the autoscaler aggressively launches additional nodes before confirming that the demand has been fulfilled?

Any insights on how to avoid this type of over-provisioning (e.g., through autoscaler delay settings, demand evaluation thresholds, or conservative scheduling options) would be greatly appreciated.

Versions / Dependencies

ray 2.45.0

Reproduction script

Actor list (After Serve Deployment)

Stats:
------------------------------
Total: 4

Table:
------------------------------
    ACTOR_ID                          CLASS_NAME                               STATE      JOB_ID  NAME                                                                                               NODE_ID                                                     PID  RAY_NAMESPACE
 0  1cb2b2b79fbc11225b59313d01000000  ServeController                          ALIVE    01000000  SERVE_CONTROLLER_ACTOR                                                                             8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317   1148  serve
 1  5130f9e62717ca880d64c62701000000  ProxyActor                               ALIVE    01000000  SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b  7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b    430  serve
 2  816b5cd4454e58298016f3b501000000  ServeReplica:default:CustomResourceTask  ALIVE    01000000  SERVE_REPLICA::default#CustomResourceTask#nBaiFX                                                   7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b    253  serve
 3  dd8062a92fd4d7eeca1757c701000000  ProxyActor                               ALIVE    01000000  SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317  8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317   1199  serve

Node list (After Serve Deployment)

Stats:
------------------------------
Total: 2

Table:
------------------------------
    NODE_ID                                                   NODE_IP      IS_HEAD_NODE    STATE    NODE_NAME    RESOURCES_TOTAL                 LABELS
 0  7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b  172.28.0.16  False           ALIVE    172.28.0.16  CPU: 16.0                       ray.io/node_id: 7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b
                                                                                                                 GPU: 1.0
                                                                                                                 gram: 24.0
                                                                                                                 memory: 28.040 GiB
                                                                                                                 node:172.28.0.16: 1.0
                                                                                                                 object_store_memory: 4.595 GiB
 1  8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317  172.28.0.10  True            ALIVE    172.28.0.10  CPU: 16.0                       ray.io/node_id: 8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317
                                                                                                                 memory: 9.094 GiB
                                                                                                                 node:172.28.0.10: 1.0
                                                                                                                 node:__internal_head__: 1.0
                                                                                                                 object_store_memory: 4.547 GiB

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray Corecore-autoscalerautoscaler related issuesusability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions