-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
Description
The cluster autoscaler v2 computes utilization as global_usage / global_limits. When a user sets execution_options.resource_limits, the global_limits becomes capped to the user-specified value. However, the global_usage is calculated from the actual resources in use by operators, which can approach or exceed the user's limits. This results in utilization always being ~100%, causing unbounded autoscaling even when the cluster has plenty of capacity.
The fix is to compute utilization relative to the total cluster resources (via a callback) rather than the global limits, ensuring autoscaling decisions reflect actual cluster utilization.
Background
The autoscaler v2 (DefaultClusterAutoscalerV2) triggers scale-up when utilization exceeds 75%. The utilization is calculated in RollingLogicalUtilizationGauge.observe():
def observe(self):
global_usage = self._resource_manager.get_global_usage()
global_limits = self._resource_manager.get_global_limits()
cpu_util = save_div(global_usage.cpu, global_limits.cpu)
# ... same for gpu, object_store_memoryThe global_limits is computed in ResourceManager.get_global_limits():
def get_global_limits(self):
default_limits = self._options.resource_limits # User-specified limits
total_resources = self._get_total_resources() # From AutoscalingCoordinator
# ...
self._global_limits = default_limits.min(total_resources).subtract(exclude)When a user sets resource_limits (e.g., resource_limits=ExecutionResources(cpu=8)), the global_limits is capped to that value. If the dataset consumes all 8 CPUs, utilization = 8/8 = 100%, triggering autoscaling indefinitely—even if the cluster has 100+ CPUs available.
Implementation Boundaries & Constraints
-
Where to change: The
RollingLogicalUtilizationGaugeclass inresource_utilization_gauge.py. -
Approach: Add a
get_total_resourcescallback parameter toRollingLogicalUtilizationGauge.__init__(). Use this callback to get total cluster resources for the utilization denominator instead ofglobal_limits. By default, the callback should return the autoscaler'sget_total_resources. -
Preserve global_limits semantics: The
global_limitsused by the ResourceManager for throttling operator execution should remain unchanged. This fix only affects the utilization calculation for autoscaling decisions.