-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] Refactor _AutoscalingCoordinatorActor to use direct threading instead of self-referential remote calls #60190
Description
Description
Refactor the _AutoscalingCoordinatorActor class to perform periodic operations using multi-threading and direct method calls with proper locking, rather than the current pattern of making remote calls to itself via its own actor handle.
Background
The _AutoscalingCoordinatorActor (defined in python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py) is responsible for coordinating autoscaling resource requests from different Ray Data components. It performs periodic operations (merging requests, updating cluster resources, reallocating resources) via a background thread.
Currently, the actor uses a self-referential remote call pattern to achieve thread-safety:
# Lines 261-273 in default_autoscaling_coordinator.py
if ray.is_initialized():
self._self_handle = ray.get_runtime_context().current_actor
def tick_thread_run():
while True:
time.sleep(self.TICK_INTERVAL_S)
ray.get(self._self_handle.tick.remote()) # Remote call to self
self._tick_thread = threading.Thread(target=tick_thread_run, daemon=True)
self._tick_thread.start()This pattern routes the tick() call through Ray's actor task queue, which serializes calls and avoids the need for explicit locking.
The same pattern exists in AutoscalingRequester (python/ray/data/_internal/execution/autoscaling_requester.py, lines 38-49), though that class is outside the scope of this issue.
Motivation
The self-referential remote call pattern goes through Ray Core's RPC infrastructure, which can timeout under load (e.g., Ray Core timeout errors have been observed). It can also make the code harder to debug.
A simpler approach using a threading lock would:
- Make the code easier to debug
- Eliminate potential Ray Core timeout errors for internal operations
- Be more straightforward to test
Implementation Boundaries & Constraints
- Focus only on
_AutoscalingCoordinatorActorin this issue; do not changeAutoscalingRequester - Ensure thread-safety: the background tick thread and the main actor methods (
request_resources,cancel_request,get_allocated_resources) can be called concurrently - Maintain the existing tick interval behavior (every 20 seconds)