[Data] Fix cluster autoscaler v2 utilization calculation when `resource_limits` is set

**Description**

The cluster autoscaler v2 computes utilization as `global_usage / global_limits`. When a user sets `execution_options.resource_limits`, the `global_limits` becomes capped to the user-specified value. However, the `global_usage` is calculated from the actual resources in use by operators, which can approach or exceed the user's limits. This results in utilization always being ~100%, causing unbounded autoscaling even when the cluster has plenty of capacity.

The fix is to compute utilization relative to the **total cluster resources** (via a callback) rather than the global limits, ensuring autoscaling decisions reflect actual cluster utilization.

**Background**

The autoscaler v2 ([`DefaultClusterAutoscalerV2`](https://github.com/ray-project/ray/blob/67bfeefa82/python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py#L52-L69)) triggers scale-up when utilization exceeds 75%. The utilization is calculated in [`RollingLogicalUtilizationGauge.observe()`](https://github.com/ray-project/ray/blob/67bfeefa82/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L45-L65):

```python
def observe(self):
    global_usage = self._resource_manager.get_global_usage()
    global_limits = self._resource_manager.get_global_limits()

    cpu_util = save_div(global_usage.cpu, global_limits.cpu)
    # ... same for gpu, object_store_memory
```

The `global_limits` is computed in [`ResourceManager.get_global_limits()`](https://github.com/ray-project/ray/blob/67bfeefa82/python/ray/data/_internal/execution/resource_manager.py#L248-L271):

```python
def get_global_limits(self):
    default_limits = self._options.resource_limits   # User-specified limits
    total_resources = self._get_total_resources()    # From AutoscalingCoordinator
    # ...
    self._global_limits = default_limits.min(total_resources).subtract(exclude)
```

When a user sets `resource_limits` (e.g., `resource_limits=ExecutionResources(cpu=8)`), the `global_limits` is capped to that value. If the dataset consumes all 8 CPUs, utilization = 8/8 = 100%, triggering autoscaling indefinitely—even if the cluster has 100+ CPUs available.

**Implementation Boundaries & Constraints**

- **Where to change:** The `RollingLogicalUtilizationGauge` class in [`resource_utilization_gauge.py`](https://github.com/ray-project/ray/blob/67bfeefa82/python/ray/data/_internal/cluster_autoscaler/resource_utilization_gauge.py#L22-L73).

- **Approach:** Add a `get_total_resources` callback parameter to `RollingLogicalUtilizationGauge.__init__()`. Use this callback to get total cluster resources for the utilization denominator instead of `global_limits`. By default, the callback should return the autoscaler's `get_total_resources`.

- **Preserve global_limits semantics:** The `global_limits` used by the ResourceManager for *throttling* operator execution should remain unchanged. This fix only affects the utilization calculation for *autoscaling decisions*.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix cluster autoscaler v2 utilization calculation when `resource_limits` is set #60085

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Fix cluster autoscaler v2 utilization calculation when resource_limits is set #60085

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Data] Fix cluster autoscaler v2 utilization calculation when `resource_limits` is set #60085