Skip to content

[Data] Fix cluster autoscaler v2 utilization calculation when resource_limits is set #60085

@bveeramani

Description

@bveeramani

Description

The cluster autoscaler v2 computes utilization as global_usage / global_limits. When a user sets execution_options.resource_limits, the global_limits becomes capped to the user-specified value. However, the global_usage is calculated from the actual resources in use by operators, which can approach or exceed the user's limits. This results in utilization always being ~100%, causing unbounded autoscaling even when the cluster has plenty of capacity.

The fix is to compute utilization relative to the total cluster resources (via a callback) rather than the global limits, ensuring autoscaling decisions reflect actual cluster utilization.

Background

The autoscaler v2 (DefaultClusterAutoscalerV2) triggers scale-up when utilization exceeds 75%. The utilization is calculated in RollingLogicalUtilizationGauge.observe():

def observe(self):
    global_usage = self._resource_manager.get_global_usage()
    global_limits = self._resource_manager.get_global_limits()

    cpu_util = save_div(global_usage.cpu, global_limits.cpu)
    # ... same for gpu, object_store_memory

The global_limits is computed in ResourceManager.get_global_limits():

def get_global_limits(self):
    default_limits = self._options.resource_limits   # User-specified limits
    total_resources = self._get_total_resources()    # From AutoscalingCoordinator
    # ...
    self._global_limits = default_limits.min(total_resources).subtract(exclude)

When a user sets resource_limits (e.g., resource_limits=ExecutionResources(cpu=8)), the global_limits is capped to that value. If the dataset consumes all 8 CPUs, utilization = 8/8 = 100%, triggering autoscaling indefinitely—even if the cluster has 100+ CPUs available.

Implementation Boundaries & Constraints

  • Where to change: The RollingLogicalUtilizationGauge class in resource_utilization_gauge.py.

  • Approach: Add a get_total_resources callback parameter to RollingLogicalUtilizationGauge.__init__(). Use this callback to get total cluster resources for the utilization denominator instead of global_limits. By default, the callback should return the autoscaler's get_total_resources.

  • Preserve global_limits semantics: The global_limits used by the ResourceManager for throttling operator execution should remain unchanged. This fix only affects the utilization calculation for autoscaling decisions.

Metadata

Metadata

Assignees

Labels

dataRay Data-related issues

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions