Skip to content

[Data] Refactor _AutoscalingCoordinatorActor to use direct threading instead of self-referential remote calls #60190

@bveeramani

Description

@bveeramani

Description

Refactor the _AutoscalingCoordinatorActor class to perform periodic operations using multi-threading and direct method calls with proper locking, rather than the current pattern of making remote calls to itself via its own actor handle.

Background

The _AutoscalingCoordinatorActor (defined in python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py) is responsible for coordinating autoscaling resource requests from different Ray Data components. It performs periodic operations (merging requests, updating cluster resources, reallocating resources) via a background thread.

Currently, the actor uses a self-referential remote call pattern to achieve thread-safety:

# Lines 261-273 in default_autoscaling_coordinator.py
if ray.is_initialized():
    self._self_handle = ray.get_runtime_context().current_actor

    def tick_thread_run():
        while True:
            time.sleep(self.TICK_INTERVAL_S)
            ray.get(self._self_handle.tick.remote())  # Remote call to self

    self._tick_thread = threading.Thread(target=tick_thread_run, daemon=True)
    self._tick_thread.start()

This pattern routes the tick() call through Ray's actor task queue, which serializes calls and avoids the need for explicit locking.

The same pattern exists in AutoscalingRequester (python/ray/data/_internal/execution/autoscaling_requester.py, lines 38-49), though that class is outside the scope of this issue.

Motivation

The self-referential remote call pattern goes through Ray Core's RPC infrastructure, which can timeout under load (e.g., Ray Core timeout errors have been observed). It can also make the code harder to debug.

A simpler approach using a threading lock would:

  • Make the code easier to debug
  • Eliminate potential Ray Core timeout errors for internal operations
  • Be more straightforward to test

Implementation Boundaries & Constraints

  • Focus only on _AutoscalingCoordinatorActor in this issue; do not change AutoscalingRequester
  • Ensure thread-safety: the background tick thread and the main actor methods (request_resources, cancel_request, get_allocated_resources) can be called concurrently
  • Maintain the existing tick interval behavior (every 20 seconds)

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksdataRay Data-related issues

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions