[RFC][Serve] Static Deployment with External Placement Group Support

## Summary

Extend `@serve.deployment` API to accept an external `PlacementGroup` with explicit replica-to-bundle mapping via a private `_placement_info` parameter. This enables GPU colocation between Serve deployments and other Ray components while preserving Serve's HTTP routing, health checks, and observability.

## Motivation

### The Problem

A subset of RL training workflows require **inference engines and trainers to be placed on the same GPUs** for:
- Zero-copy weight sync via CUDA IPC
- Memory efficiency through time-sharing (sleep/wake)

Ray Serve currently creates its own placement groups internally in the Serve controller. There's no way to tell Serve: *"Schedule replicas on this existing placement group that my trainers are using."*

While the current ray serve APIs enable non-colocated deployment options (async RL, etc) it still falls short in the colocated settings making, preventing me to fully integrate in the post-training frameworks. 

### Current Workarounds

1. **Skip Serve entirely** — Use Ray actors with custom HTTP servers. Loses routing, health checks, metrics, and orchestration functionalities for PD and wide-EP, etc. 
2. **Separate GPUs** — No colocation. wasting resources, etc.

### Why Serve?

Ray Serve (especially with `ray.llm`) provides significant value:
- HTTP/gRPC routing with prefix-aware distribution
- Health checks and automatic restart
- Metrics and observability  
- PD and wide-EP serving patterns
- LoRA adapter multiplexing
- Sleep/wake endpoints (already implemented)
- Collective RPC for weight sync (already implemented)

The missing piece is placement control.

## Proposed API

```python
from ray import serve
from ray.util.placement_group import placement_group

@dataclass
class StaticPlacementConfig:
    placement_group: PlacementGroup
    replica_bundle_mapping: Dict[int, List[int]]  # replica_rank -> bundle_indices
    capture_child_tasks: bool = True

# Usage
pg = placement_group([{"GPU": 1, "CPU": 1}] * 4)
ray.get(pg.ready())

@serve.deployment(
    _placement_info=StaticPlacementConfig(
        placement_group=pg,
        replica_bundle_mapping={
            0: [0, 1],  # Replica 0 uses bundles 0, 1 (for TP=2)
            1: [2, 3],  # Replica 1 uses bundles 2, 3
        },
    ),
    max_ongoing_requests=100,
)
class MyLLMServer:
    ...
```

### Important Semantics of `replica_bundle_mapping`

The `replica_bundle_mapping` does **not** enforce any scheduling policy for child actors spawned by the replica actors. Child actors can still choose to use any bundle indices (e.g., bundle indices 2 and 3 on rank 0 if they want to). 

The mapping only affects:
1. **Replica actor placement**: The replica actor itself will be scheduled on the first bundle index in its assigned list (e.g., replica 0 → bundle index 0, replica 1 → bundle index 2).
2. **Replica context**: `serve.get_replica_context().bundle_indices` will be set to the assigned bundle indices list, allowing child actors to query this information and make informed placement decisions (e.g., via `VLLM_RAY_BUNDLE_INDICES`).


## Semantics

| Aspect | Behavior |
|--------|----------|
| Replica count | Fixed to `len(replica_bundle_mapping)` |
| Autoscaling | Not supported |
| Replica actor scheduling | Rank N → scheduled on first bundle index in `replica_bundle_mapping[N]` |
| Replica context | `serve.get_replica_context().bundle_indices` set to `replica_bundle_mapping[N]` |
| Child actor scheduling | Not enforced by mapping; child actors can use any bundle indices |
| Replica failure | Restart on **same** bundle indices |
| Bundle/node failure | Replica enters FAILED state |
| PG lifecycle | External controller's responsibility |
| Other Serve features | Work unchanged |

## Implementation Sketch

### 1. Config validation

```python
@dataclass
class StaticPlacementConfig:
    placement_group: PlacementGroup
    replica_bundle_mapping: Dict[int, List[int]]
    capture_child_tasks: bool = True
    
    def __post_init__(self):
        # Ranks must be contiguous: 0, 1, 2, ...
        # Bundle lists must be non-empty
        # Bundle indices must be unique across replicas
```

### 2. Scheduler change

```python
# deployment_scheduler.py

def _schedule_replica(self, scheduling_request, ...):
    if scheduling_request._placement_info is not None:
        sp = scheduling_request._placement_info
        rank = self._get_replica_rank(replica_id)
        
        scheduling_strategy = PlacementGroupSchedulingStrategy(
            placement_group=sp.placement_group,
            placement_group_bundle_index=sp.replica_bundle_mapping[rank][0],
            placement_group_capture_child_tasks=sp.capture_child_tasks,
        )
        # Skip PG creation, use external PG directly
        # Set replica_context.bundle_indices = sp.replica_bundle_mapping[rank]
    else:
        # Existing behavior
        ...
```

### 3. Stable replica ranks

When a replica restarts, it gets the **same rank** to land on the same bundles. We can create a new replica for that slot gets the same rank.

## Ray Serve LLM Integration

Pass through via `deployment_config`:

```python
from ray.serve.llm import LLMConfig

config = LLMConfig(
    model_id="meta-llama/Llama-3-8B",
    tensor_parallelism=2,
    deployment_config={
        "_placement_info": StaticPlacementConfig(
            placement_group=pg,
            replica_bundle_mapping={0: [0, 1], 1: [2, 3]},
        ),
    },
)
```

## Constraints

- **Python API only** — `PlacementGroup` handles are not serializable to YAML
- **No autoscaling** — Replica count fixed by bundle mapping
- **External PG ownership** — Serve does not create, monitor, or destroy the PG

## Alternatives Considered

| Alternative | Why Rejected |
|-------------|--------------|
| New `@serve.static_deployment` API | Duplicates validation logic; prefer extending existing API |
| PG lookup by name | Adds indirection; names can duplicate |
| Actor pool without Serve | Loses Serve features; maintenance burden |

## Future options

- **Expose bundle info** — `serve.get_replica_context().bundle_indices` will be available to replica actors, allowing them to set `VLLM_RAY_BUNDLE_INDICES` for child actors. vLLM can then use these exact indices for placement decisions.

## Rollout

1. **Alpha** — New API, may change based on feedback
2. **Beta** — Stabilize after real-world usage (SkyRL, etc.)
3. **Stable** — Consider merging into `@serve.deployment` if patterns emerge




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Serve] Static Deployment with External Placement Group Support #59857

Summary

Motivation

The Problem

Current Workarounds

Why Serve?

Proposed API

Important Semantics of `replica_bundle_mapping`

Semantics

Implementation Sketch

1. Config validation

2. Scheduler change

3. Stable replica ranks

Ray Serve LLM Integration

Constraints

Alternatives Considered

Future options

Rollout

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Behavior
Replica count	Fixed to `len(replica_bundle_mapping)`
Autoscaling	Not supported
Replica actor scheduling	Rank N → scheduled on first bundle index in `replica_bundle_mapping[N]`
Replica context	`serve.get_replica_context().bundle_indices` set to `replica_bundle_mapping[N]`
Child actor scheduling	Not enforced by mapping; child actors can use any bundle indices
Replica failure	Restart on same bundle indices
Bundle/node failure	Replica enters FAILED state
PG lifecycle	External controller's responsibility
Other Serve features	Work unchanged

Alternative	Why Rejected
New `@serve.static_deployment` API	Duplicates validation logic; prefer extending existing API
PG lookup by name	Adds indirection; names can duplicate
Actor pool without Serve	Loses Serve features; maintenance burden

[RFC][Serve] Static Deployment with External Placement Group Support #59857

Description

Summary

Motivation

The Problem

Current Workarounds

Why Serve?

Proposed API

Important Semantics of replica_bundle_mapping

Semantics

Implementation Sketch

1. Config validation

2. Scheduler change

3. Stable replica ranks

Ray Serve LLM Integration

Constraints

Alternatives Considered

Future options

Rollout

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Important Semantics of `replica_bundle_mapping`