[Data] Refactor `_AutoscalingCoordinatorActor` to use direct threading instead of self-referential remote calls

**Description**

Refactor the `_AutoscalingCoordinatorActor` class to perform periodic operations using multi-threading and direct method calls with proper locking, rather than the current pattern of making remote calls to itself via its own actor handle.

**Background**

The `_AutoscalingCoordinatorActor` (defined in `python/ray/data/_internal/cluster_autoscaler/default_autoscaling_coordinator.py`) is responsible for coordinating autoscaling resource requests from different Ray Data components. It performs periodic operations (merging requests, updating cluster resources, reallocating resources) via a background thread.

Currently, the actor uses a self-referential remote call pattern to achieve thread-safety:

```python
# Lines 261-273 in default_autoscaling_coordinator.py
if ray.is_initialized():
    self._self_handle = ray.get_runtime_context().current_actor

    def tick_thread_run():
        while True:
            time.sleep(self.TICK_INTERVAL_S)
            ray.get(self._self_handle.tick.remote())  # Remote call to self

    self._tick_thread = threading.Thread(target=tick_thread_run, daemon=True)
    self._tick_thread.start()
```

This pattern routes the `tick()` call through Ray's actor task queue, which serializes calls and avoids the need for explicit locking.

The same pattern exists in `AutoscalingRequester` (`python/ray/data/_internal/execution/autoscaling_requester.py`, lines 38-49), though that class is outside the scope of this issue.

**Motivation**

The self-referential remote call pattern goes through Ray Core's RPC infrastructure, which can timeout under load (e.g., Ray Core timeout errors have been observed). It can also make the code harder to debug.

A simpler approach using a threading lock would:
- Make the code easier to debug
- Eliminate potential Ray Core timeout errors for internal operations
- Be more straightforward to test

**Implementation Boundaries & Constraints**

- Focus only on `_AutoscalingCoordinatorActor` in this issue; do not change `AutoscalingRequester`
- Ensure thread-safety: the background tick thread and the main actor methods (`request_resources`, `cancel_request`, `get_allocated_resources`) can be called concurrently
- Maintain the existing tick interval behavior (every 20 seconds)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Refactor `_AutoscalingCoordinatorActor` to use direct threading instead of self-referential remote calls #60190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Refactor _AutoscalingCoordinatorActor to use direct threading instead of self-referential remote calls #60190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Data] Refactor `_AutoscalingCoordinatorActor` to use direct threading instead of self-referential remote calls #60190