-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Description
Summary
Extend @serve.deployment API to accept an external PlacementGroup with explicit replica-to-bundle mapping via a private _placement_info parameter. This enables GPU colocation between Serve deployments and other Ray components while preserving Serve's HTTP routing, health checks, and observability.
Motivation
The Problem
A subset of RL training workflows require inference engines and trainers to be placed on the same GPUs for:
- Zero-copy weight sync via CUDA IPC
- Memory efficiency through time-sharing (sleep/wake)
Ray Serve currently creates its own placement groups internally in the Serve controller. There's no way to tell Serve: "Schedule replicas on this existing placement group that my trainers are using."
While the current ray serve APIs enable non-colocated deployment options (async RL, etc) it still falls short in the colocated settings making, preventing me to fully integrate in the post-training frameworks.
Current Workarounds
- Skip Serve entirely — Use Ray actors with custom HTTP servers. Loses routing, health checks, metrics, and orchestration functionalities for PD and wide-EP, etc.
- Separate GPUs — No colocation. wasting resources, etc.
Why Serve?
Ray Serve (especially with ray.llm) provides significant value:
- HTTP/gRPC routing with prefix-aware distribution
- Health checks and automatic restart
- Metrics and observability
- PD and wide-EP serving patterns
- LoRA adapter multiplexing
- Sleep/wake endpoints (already implemented)
- Collective RPC for weight sync (already implemented)
The missing piece is placement control.
Proposed API
from ray import serve
from ray.util.placement_group import placement_group
@dataclass
class StaticPlacementConfig:
placement_group: PlacementGroup
replica_bundle_mapping: Dict[int, List[int]] # replica_rank -> bundle_indices
capture_child_tasks: bool = True
# Usage
pg = placement_group([{"GPU": 1, "CPU": 1}] * 4)
ray.get(pg.ready())
@serve.deployment(
_placement_info=StaticPlacementConfig(
placement_group=pg,
replica_bundle_mapping={
0: [0, 1], # Replica 0 uses bundles 0, 1 (for TP=2)
1: [2, 3], # Replica 1 uses bundles 2, 3
},
),
max_ongoing_requests=100,
)
class MyLLMServer:
...Important Semantics of replica_bundle_mapping
The replica_bundle_mapping does not enforce any scheduling policy for child actors spawned by the replica actors. Child actors can still choose to use any bundle indices (e.g., bundle indices 2 and 3 on rank 0 if they want to).
The mapping only affects:
- Replica actor placement: The replica actor itself will be scheduled on the first bundle index in its assigned list (e.g., replica 0 → bundle index 0, replica 1 → bundle index 2).
- Replica context:
serve.get_replica_context().bundle_indiceswill be set to the assigned bundle indices list, allowing child actors to query this information and make informed placement decisions (e.g., viaVLLM_RAY_BUNDLE_INDICES).
Semantics
| Aspect | Behavior |
|---|---|
| Replica count | Fixed to len(replica_bundle_mapping) |
| Autoscaling | Not supported |
| Replica actor scheduling | Rank N → scheduled on first bundle index in replica_bundle_mapping[N] |
| Replica context | serve.get_replica_context().bundle_indices set to replica_bundle_mapping[N] |
| Child actor scheduling | Not enforced by mapping; child actors can use any bundle indices |
| Replica failure | Restart on same bundle indices |
| Bundle/node failure | Replica enters FAILED state |
| PG lifecycle | External controller's responsibility |
| Other Serve features | Work unchanged |
Implementation Sketch
1. Config validation
@dataclass
class StaticPlacementConfig:
placement_group: PlacementGroup
replica_bundle_mapping: Dict[int, List[int]]
capture_child_tasks: bool = True
def __post_init__(self):
# Ranks must be contiguous: 0, 1, 2, ...
# Bundle lists must be non-empty
# Bundle indices must be unique across replicas2. Scheduler change
# deployment_scheduler.py
def _schedule_replica(self, scheduling_request, ...):
if scheduling_request._placement_info is not None:
sp = scheduling_request._placement_info
rank = self._get_replica_rank(replica_id)
scheduling_strategy = PlacementGroupSchedulingStrategy(
placement_group=sp.placement_group,
placement_group_bundle_index=sp.replica_bundle_mapping[rank][0],
placement_group_capture_child_tasks=sp.capture_child_tasks,
)
# Skip PG creation, use external PG directly
# Set replica_context.bundle_indices = sp.replica_bundle_mapping[rank]
else:
# Existing behavior
...3. Stable replica ranks
When a replica restarts, it gets the same rank to land on the same bundles. We can create a new replica for that slot gets the same rank.
Ray Serve LLM Integration
Pass through via deployment_config:
from ray.serve.llm import LLMConfig
config = LLMConfig(
model_id="meta-llama/Llama-3-8B",
tensor_parallelism=2,
deployment_config={
"_placement_info": StaticPlacementConfig(
placement_group=pg,
replica_bundle_mapping={0: [0, 1], 1: [2, 3]},
),
},
)Constraints
- Python API only —
PlacementGrouphandles are not serializable to YAML - No autoscaling — Replica count fixed by bundle mapping
- External PG ownership — Serve does not create, monitor, or destroy the PG
Alternatives Considered
| Alternative | Why Rejected |
|---|---|
New @serve.static_deployment API |
Duplicates validation logic; prefer extending existing API |
| PG lookup by name | Adds indirection; names can duplicate |
| Actor pool without Serve | Loses Serve features; maintenance burden |
Future options
- Expose bundle info —
serve.get_replica_context().bundle_indiceswill be available to replica actors, allowing them to setVLLM_RAY_BUNDLE_INDICESfor child actors. vLLM can then use these exact indices for placement decisions.
Rollout
- Alpha — New API, may change based on feedback
- Beta — Stabilize after real-world usage (SkyRL, etc.)
- Stable — Consider merging into
@serve.deploymentif patterns emerge
Metadata
Metadata
Assignees
Labels
Type
Projects
Status