Skip to content

[1/3] queue-based autoscaling - add queue monitor#59430

Merged
abrarsheikh merged 18 commits intomasterfrom
queue-based-autoscaling-part-1
Jan 8, 2026
Merged

[1/3] queue-based autoscaling - add queue monitor#59430
abrarsheikh merged 18 commits intomasterfrom
queue-based-autoscaling-part-1

Conversation

@harshit-anyscale
Copy link
Copy Markdown
Contributor

@harshit-anyscale harshit-anyscale commented Dec 15, 2025

Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for Ray Serve TaskConsumer deployments.

Background

TaskConsumers are workloads that consume tasks from message queues (Redis, RabbitMQ), and their scaling needs are fundamentally different from HTTP-based deployments. Instead of scaling based on HTTP request load, TaskConsumers should scale based on the number of pending tasks in the message queue.

Overall Architecture (Full Feature)

  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘

The full implementation consists of three PRs:

PR Description Status
PR 1 (This PR) QueueMonitor actor for querying broker queue length 🔄 Current
PR 2 Introduce default Queue-based autoscaling policy Upcoming
PR 3 Integration with TaskConsumer deployments Upcoming

This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message brokers to get queue length for autoscaling decisions.

Key Features

  • Multi-broker support: Redis and RabbitMQ
  • Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in runtime env
  • Fault tolerance: Caches last known queue length on query failures
  • Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending tasks):

Broker Pending Tasks
Redis LLEN
RabbitMQ messages_ready

Components

  1. QueueMonitorConfig - Configuration dataclass with broker URL and queue name
  2. QueueMonitor - Core class that initializes broker connections and queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

Test Plan

  • Unit tests for QueueMonitorConfig (7 tests)
    • Broker type detection (Redis, RabbitMQ, SQS, unknown)
    • Config value storage
  • Unit tests for QueueMonitor (4 tests)
    • Redis queue length retrieval (pending)
    • RabbitMQ queue length retrieval
    • Error handling with cached value fallback

@harshit-anyscale harshit-anyscale self-assigned this Dec 15, 2025
@harshit-anyscale harshit-anyscale changed the title add queue monitor [1/n] queue-based autoscaling - add queue monitor Dec 15, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a QueueMonitor actor for monitoring queue lengths in Redis and RabbitMQ, which is a valuable addition for asynchronous task processing. The implementation is generally well-structured and includes unit tests. However, I've identified a significant performance concern with the RabbitMQ connection handling that should be addressed. Additionally, there's a minor inconsistency in broker type detection and an opportunity to improve test coverage for the new actor helper functions. My detailed feedback is in the comments below.

@harshit-anyscale harshit-anyscale changed the title [1/n] queue-based autoscaling - add queue monitor [1/3] queue-based autoscaling - add queue monitor Dec 15, 2025
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Dec 15, 2025
@harshit-anyscale harshit-anyscale force-pushed the queue-based-autoscaling-part-1 branch from 14a17b6 to a982d4b Compare December 15, 2025 11:47
@harshit-anyscale harshit-anyscale marked this pull request as ready for review December 15, 2025 11:47
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner December 15, 2025 12:19
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Dec 15, 2025
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale harshit-anyscale force-pushed the queue-based-autoscaling-part-1 branch from ce6a041 to ad81408 Compare December 17, 2025 18:04
Copy link
Copy Markdown
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. does not need my review any more?

seems that import pika is still in there though?

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
@harshit-anyscale
Copy link
Copy Markdown
Contributor Author

harshit-anyscale commented Dec 19, 2025

hmm.. does not need my review any more?

seems that import pika is still in there though?

nope, import pika is now resolved as well.

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
@abrarsheikh abrarsheikh merged commit f580a27 into master Jan 8, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the queue-based-autoscaling-part-1 branch January 8, 2026 20:07
elliot-barn pushed a commit that referenced this pull request Jan 11, 2026
### Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.

###  Background

TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.

###  Overall Architecture (Full Feature)
```
  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘
```
  The full implementation consists of three PRs:

| PR | Description | Status |

|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
🔄 Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |

###  This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.

###  Key Features

  - Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
  - Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

###  Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):

  | Broker   | Pending Tasks  |
  |----------|----------------|
  | Redis    | LLEN <queue>   |
  | RabbitMQ | messages_ready  |


###  Components

1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

  
###  Test Plan

  - Unit tests for QueueMonitorConfig (7 tests)
    - Broker type detection (Redis, RabbitMQ, SQS, unknown)
    - Config value storage
  - Unit tests for QueueMonitor (4 tests)
    - Redis queue length retrieval (pending)
    - RabbitMQ queue length retrieval
    - Error handling with cached value fallback

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
### Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.

###  Background

TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.

###  Overall Architecture (Full Feature)
```
  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘
```
  The full implementation consists of three PRs:

| PR | Description | Status |

|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
🔄 Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |

###  This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.

###  Key Features

  - Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
  - Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

###  Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):

  | Broker   | Pending Tasks  |
  |----------|----------------|
  | Redis    | LLEN <queue>   |
  | RabbitMQ | messages_ready  |

###  Components

1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

###  Test Plan

  - Unit tests for QueueMonitorConfig (7 tests)
    - Broker type detection (Redis, RabbitMQ, SQS, unknown)
    - Config value storage
  - Unit tests for QueueMonitor (4 tests)
    - Redis queue length retrieval (pending)
    - RabbitMQ queue length retrieval
    - Error handling with cached value fallback

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
### Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.

###  Background

TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.

###  Overall Architecture (Full Feature)
```
  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘
```
  The full implementation consists of three PRs:

| PR | Description | Status |

|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
🔄 Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |

###  This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.

###  Key Features

  - Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
  - Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

###  Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):

  | Broker   | Pending Tasks  |
  |----------|----------------|
  | Redis    | LLEN <queue>   |
  | RabbitMQ | messages_ready  |


###  Components

1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

  
###  Test Plan

  - Unit tests for QueueMonitorConfig (7 tests)
    - Broker type detection (Redis, RabbitMQ, SQS, unknown)
    - Config value storage
  - Unit tests for QueueMonitor (4 tests)
    - Redis queue length retrieval (pending)
    - RabbitMQ queue length retrieval
    - Error handling with cached value fallback

---------

Signed-off-by: harshit <harshit@anyscale.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
### Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.

###  Background

TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.

###  Overall Architecture (Full Feature)
```
  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘
```
  The full implementation consists of three PRs:

| PR | Description | Status |

|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
🔄 Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |

###  This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.

###  Key Features

  - Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
  - Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

###  Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):

  | Broker   | Pending Tasks  |
  |----------|----------------|
  | Redis    | LLEN <queue>   |
  | RabbitMQ | messages_ready  |


###  Components

1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

  
###  Test Plan

  - Unit tests for QueueMonitorConfig (7 tests)
    - Broker type detection (Redis, RabbitMQ, SQS, unknown)
    - Config value storage
  - Unit tests for QueueMonitor (4 tests)
    - Redis queue length retrieval (pending)
    - RabbitMQ queue length retrieval
    - Error handling with cached value fallback

---------

Signed-off-by: harshit <harshit@anyscale.com>
abrarsheikh pushed a commit that referenced this pull request Feb 5, 2026
…olicy (#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[#59430](#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…olicy (ray-project#59548)


## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
tiennguyentony pushed a commit to tiennguyentony/ray that referenced this pull request Feb 7, 2026
…olicy (ray-project#59548)


## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…olicy (#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[#59430](#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
…olicy (#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[#59430](#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
MuhammadSaif700 pushed a commit to MuhammadSaif700/ray that referenced this pull request Feb 17, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
Kunchd pushed a commit to Kunchd/ray that referenced this pull request Feb 17, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
### Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.

###  Background

TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.

###  Overall Architecture (Full Feature)
```
  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘
```
  The full implementation consists of three PRs:

| PR | Description | Status |

|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
🔄 Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |

###  This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.

###  Key Features

  - Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
  - Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

###  Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):

  | Broker   | Pending Tasks  |
  |----------|----------------|
  | Redis    | LLEN <queue>   |
  | RabbitMQ | messages_ready  |

###  Components

1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

###  Test Plan

  - Unit tests for QueueMonitorConfig (7 tests)
    - Broker type detection (Redis, RabbitMQ, SQS, unknown)
    - Config value storage
  - Unit tests for QueueMonitor (4 tests)
    - Redis queue length retrieval (pending)
    - RabbitMQ queue length retrieval
    - Error handling with cached value fallback

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
### Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for
Ray Serve TaskConsumer deployments.

###  Background

TaskConsumers are workloads that consume tasks from message queues
(Redis, RabbitMQ), and their scaling needs are fundamentally different
from HTTP-based deployments. Instead of scaling based on HTTP request
load, TaskConsumers should scale based on the number of pending tasks in
the message queue.

###  Overall Architecture (Full Feature)
```
  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘
```
  The full implementation consists of three PRs:

| PR | Description | Status |

|----------------|-----------------------------------------------------|------------|
| PR 1 (This PR) | QueueMonitor actor for querying broker queue length |
🔄 Current |
| PR 2 | Introduce default Queue-based autoscaling policy | Upcoming |
| PR 3 | Integration with TaskConsumer deployments | Upcoming |

###  This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message
brokers to get queue length for autoscaling decisions.

###  Key Features

  - Multi-broker support: Redis and RabbitMQ
- Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in
runtime env
  - Fault tolerance: Caches last known queue length on query failures
- Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

###  Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending
tasks):

  | Broker   | Pending Tasks  |
  |----------|----------------|
  | Redis    | LLEN <queue>   |
  | RabbitMQ | messages_ready  |

###  Components

1. QueueMonitorConfig - Configuration dataclass with broker URL and
queue name
2. QueueMonitor - Core class that initializes broker connections and
queries queue length
  3. QueueMonitorActor - Ray actor wrapper for remote access
  4. Helper functions:
    - create_queue_monitor_actor() - Create named actor
    - get_queue_monitor_actor() - Lookup existing actor
    - delete_queue_monitor_actor() - Cleanup on deployment deletion

###  Test Plan

  - Unit tests for QueueMonitorConfig (7 tests)
    - Broker type detection (Redis, RabbitMQ, SQS, unknown)
    - Config value storage
  - Unit tests for QueueMonitor (4 tests)
    - Redis queue length retrieval (pending)
    - RabbitMQ queue length retrieval
    - Error handling with cached value fallback

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…olicy (ray-project#59548)

## Summary

This PR adds queue-based autoscaling support for async inference
workloads in Ray Serve. It enables deployments to scale based on
combined workload from both the message broker queue and HTTP requests.

**Related PRs:**
- PR 1 (Prerequisite):
[ray-project#59430](ray-project#59430) - Broker and
QueueMonitor foundation
- PR 3 (Follow-up): Integration with TaskConsumer

## Changes

### New Autoscaling Policy

| Component | Description |
|-----------|-------------|
| `async_inference_autoscaling_policy()` | Scales replicas based on
combined workload: `queue_length + total_num_requests` |
| `default_async_inference_autoscaling_policy` | Export alias for the
new policy |

### QueueMonitor Enhancements

The `QueueMonitorActor` now pushes queue metrics to the controller for
autoscaling:

- Accepts `deployment_id` and `controller_handle` parameters
- Uses `MetricsPusher` to periodically push queue length to the
controller
- `start_metrics_pusher()` - deferred initialization (event loop not
available in `__init__`)
- Lazy initialization in `get_queue_length()` handles actor restarts
- Synchronous `__ray_shutdown__` (Ray calls it without awaiting)

### Controller Integration

- New `record_autoscaling_metrics_from_async_inference_task_queue()`
method
- New gauge:
`serve_autoscaling_async_inference_task_queue_metrics_delay_ms`

### New Types

- `AsyncInferenceTaskQueueMetricReport` - dataclass for queue metrics
from QueueMonitor to controller
- `AutoscalingContext.async_inference_task_queue_length` - new property
for queue length

## Scaling Formula

```python
total_workload = queue_length + total_num_requests
desired_replicas = total_workload / target_ongoing_requests
```

Example:
- Queue: 100 pending tasks
- HTTP: 50 ongoing requests
- `target_ongoing_requests`: 10
- Desired replicas = (100 + 50) / 10 = 15

---
🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants