-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Autoscaler][v1] Autoscaler launches extra nodes despite fulfilled resource demand #52864
Description
What happened + What you expected to happen
Summary:
We have set up Ray with a custom external node provider and are running the autoscaler as a separate process. The goal is for the autoscaler to detect resource demands from a Ray Serve deployment and request a specific node type from the external node provider accordingly.
Configuration:
- The cluster config defines multiple node types, including ray.worker.4090.standard, ray.worker.4090.highmem, and ray.worker.4090.ultra.
- A Ray Serve deployment requests the following resources:
autoscaler-config.yaml
available_node_types:
ray.worker.4090.standard:
min_workers: 0
max_workers: 5
resources: {"CPU": 16, "GPU": 1, "memory": 30107260928 , "gram": 24 }
node_config: {}
ray.worker.4090.highmem:
min_workers: 0
max_workers: 5
resources: {"CPU": 16, "GPU": 1, "memory": 62277025792 , "gram": 24}
node_config: {}
ray.worker.4090.ultra:
min_workers: 0
max_workers: 5
resources: {"CPU": 32, "GPU": 1, "memory": 130997290496 , "gram": 24 }
node_config: {}serve code
# for ray.worker.4090.standard
@serve.deployment(ray_actor_options={"num_cpus": 16, "num_gpus": 1, "memory": 30107260928 ,"resources": {"gram": 24}})
def CustomResourceTask(*args):
return "ray.worker.4090.standard"
serve.run(CustomResourceTask.bind())
print("Requested additional resources...")Expected Behavior:
- When the Python code is executed, the autoscaler detects the demand and requests a single node of type ray.worker.4090.standard({"num_cpus": 16, "num_gpus": 1, "memory": 30107260928 ,"resources": {"gram": 24}}) from the node provider.
- Only that node type should be launched, as it satisfies the resource requirements.
Actual Behavior:
- The autoscaler does initially request a ray.worker.4090.standard node as expected.
- However, immediately after, it also sends a launch request for a higher-spec node type (ray.worker.4090.highmem), even though the demand was already satisfied by the standard node.
- When the deployment directly requests the highest-spec node type (e.g., ultra), two nodes are requested by the autoscaler.
Observation:
- This issue does not occur if the appropriate node (ray.worker.4090.standard) already exists in the cluster before the actor is scheduled.
- In that case, the actor gets scheduled correctly, and no additional autoscaling occurs.
Could this be a race condition between the autoscaler and the node provider, where the updated resource availability has not yet propagated before the scheduler makes a decision?
Or is it a scheduling policy issue, where the autoscaler aggressively launches additional nodes before confirming that the demand has been fulfilled?
Any insights on how to avoid this type of over-provisioning (e.g., through autoscaler delay settings, demand evaluation thresholds, or conservative scheduling options) would be greatly appreciated.
Versions / Dependencies
ray 2.45.0
Reproduction script
Actor list (After Serve Deployment)
Stats:
------------------------------
Total: 4
Table:
------------------------------
ACTOR_ID CLASS_NAME STATE JOB_ID NAME NODE_ID PID RAY_NAMESPACE
0 1cb2b2b79fbc11225b59313d01000000 ServeController ALIVE 01000000 SERVE_CONTROLLER_ACTOR 8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317 1148 serve
1 5130f9e62717ca880d64c62701000000 ProxyActor ALIVE 01000000 SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b 7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b 430 serve
2 816b5cd4454e58298016f3b501000000 ServeReplica:default:CustomResourceTask ALIVE 01000000 SERVE_REPLICA::default#CustomResourceTask#nBaiFX 7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b 253 serve
3 dd8062a92fd4d7eeca1757c701000000 ProxyActor ALIVE 01000000 SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317 8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317 1199 serve
Node list (After Serve Deployment)
Stats:
------------------------------
Total: 2
Table:
------------------------------
NODE_ID NODE_IP IS_HEAD_NODE STATE NODE_NAME RESOURCES_TOTAL LABELS
0 7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b 172.28.0.16 False ALIVE 172.28.0.16 CPU: 16.0 ray.io/node_id: 7fd36d41eb420d38e6bf122401df4fe88d4c29f2a4fd52d6d8783f0b
GPU: 1.0
gram: 24.0
memory: 28.040 GiB
node:172.28.0.16: 1.0
object_store_memory: 4.595 GiB
1 8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317 172.28.0.10 True ALIVE 172.28.0.10 CPU: 16.0 ray.io/node_id: 8e07ab05c55f81ea3e84d30f12da275848e10d958c8100787eb9e317
memory: 9.094 GiB
node:172.28.0.10: 1.0
node:__internal_head__: 1.0
object_store_memory: 4.547 GiB
Issue Severity
High: It blocks me from completing my task.