Skip to content

[2/3] queue-based autoscaling - add default queue-based autoscaling policy#59548

Merged
abrarsheikh merged 15 commits intomasterfrom
queue-based-autoscaling-part-2
Feb 5, 2026
Merged

[2/3] queue-based autoscaling - add default queue-based autoscaling policy#59548
abrarsheikh merged 15 commits intomasterfrom
queue-based-autoscaling-part-2

Conversation

@harshit-anyscale
Copy link
Copy Markdown
Contributor

@harshit-anyscale harshit-anyscale commented Dec 18, 2025

Summary

This PR adds queue-based autoscaling support for async inference workloads in Ray Serve. It enables deployments to scale based on combined workload from both the message broker queue and HTTP requests.

Related PRs:

  • PR 1 (Prerequisite): #59430 - Broker and QueueMonitor foundation
  • PR 3 (Follow-up): Integration with TaskConsumer

Changes

New Autoscaling Policy

Component Description
async_inference_autoscaling_policy() Scales replicas based on combined workload: queue_length + total_num_requests
default_async_inference_autoscaling_policy Export alias for the new policy

QueueMonitor Enhancements

The QueueMonitorActor now pushes queue metrics to the controller for autoscaling:

  • Accepts deployment_id and controller_handle parameters
  • Uses MetricsPusher to periodically push queue length to the controller
  • start_metrics_pusher() - deferred initialization (event loop not available in __init__)
  • Lazy initialization in get_queue_length() handles actor restarts
  • Synchronous __ray_shutdown__ (Ray calls it without awaiting)

Controller Integration

  • New record_autoscaling_metrics_from_async_inference_task_queue() method
  • New gauge: serve_autoscaling_async_inference_task_queue_metrics_delay_ms

New Types

  • AsyncInferenceTaskQueueMetricReport - dataclass for queue metrics from QueueMonitor to controller
  • AutoscalingContext.async_inference_task_queue_length - new property for queue length

Scaling Formula

total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests

Example:

  • Queue: 100 pending tasks
  • HTTP: 50 ongoing requests
  • target_ongoing_requests: 10
  • Desired replicas = (100 + 50) / 10 = 15

🤖 Generated with Claude Code

@harshit-anyscale harshit-anyscale self-assigned this Dec 18, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new queue-based autoscaling policy, which is a great addition for TaskConsumer deployments. The implementation is well-structured, with a dedicated QueueMonitor actor and comprehensive unit tests. I've identified a critical bug in the Redis connection handling and a high-severity logic issue in the scaling-to-zero implementation. Addressing these will ensure the new feature is robust and behaves as expected.

@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Dec 19, 2025
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 2, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 2, 2026
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale force-pushed the queue-based-autoscaling-part-2 branch from 86223e0 to 26d4bd7 Compare January 9, 2026 07:05
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 9, 2026
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale force-pushed the queue-based-autoscaling-part-2 branch from 29f8b0b to 0c7cf30 Compare January 12, 2026 18:08
@harshit-anyscale harshit-anyscale marked this pull request as ready for review January 12, 2026 19:07
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner January 12, 2026 19:07
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Jan 13, 2026
Copy link
Copy Markdown
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would base this PR on top of #58857

@harshit-anyscale
Copy link
Copy Markdown
Contributor Author

i would base this PR on top of #58857

i am not sure of the timeline we are targeting for #58857 PR, but since we want to get queue-based autoscaling feature out asap, hence, thought of merging this PR as it is.

and then once the #58857 PR is merged, i will create a new one, using the changes of #58857 to refactor the queue-aware autoscaling policy.

@abrarsheikh lmk your thoughts on it.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: harshit <harshit@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Copy link
Copy Markdown
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add end to end tests for autoscaling, or is that not possible in this PR?

@harshit-anyscale
Copy link
Copy Markdown
Contributor Author

harshit-anyscale commented Feb 3, 2026

can you add end to end tests for autoscaling, or is that not possible in this PR?

that won't be possible in this PR as the integration of this queue based autoscaling policy with serve deployments is still pending, will add that in the follow-up PR.

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: harshit <harshit@anyscale.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: harshit <harshit@anyscale.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@abrarsheikh abrarsheikh merged commit 6a5e3de into master Feb 5, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the queue-based-autoscaling-part-2 branch February 5, 2026 21:28
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…olicy (ray-project#59548)


## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…olicy (ray-project#59548)


## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…olicy (#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[#59430](#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…olicy (#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[#59430](#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
MuhammadSaif700 pushed a commit to MuhammadSaif700/ray that referenced this pull request Feb 17, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
Kunchd pushed a commit to Kunchd/ray that referenced this pull request Feb 17, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants