Skip to content

[RFC][Serve] Static Deployment with External Placement Group Support #59857

@kouroshHakha

Description

@kouroshHakha

Summary

Extend @serve.deployment API to accept an external PlacementGroup with explicit replica-to-bundle mapping via a private _placement_info parameter. This enables GPU colocation between Serve deployments and other Ray components while preserving Serve's HTTP routing, health checks, and observability.

Motivation

The Problem

A subset of RL training workflows require inference engines and trainers to be placed on the same GPUs for:

  • Zero-copy weight sync via CUDA IPC
  • Memory efficiency through time-sharing (sleep/wake)

Ray Serve currently creates its own placement groups internally in the Serve controller. There's no way to tell Serve: "Schedule replicas on this existing placement group that my trainers are using."

While the current ray serve APIs enable non-colocated deployment options (async RL, etc) it still falls short in the colocated settings making, preventing me to fully integrate in the post-training frameworks.

Current Workarounds

  1. Skip Serve entirely — Use Ray actors with custom HTTP servers. Loses routing, health checks, metrics, and orchestration functionalities for PD and wide-EP, etc.
  2. Separate GPUs — No colocation. wasting resources, etc.

Why Serve?

Ray Serve (especially with ray.llm) provides significant value:

  • HTTP/gRPC routing with prefix-aware distribution
  • Health checks and automatic restart
  • Metrics and observability
  • PD and wide-EP serving patterns
  • LoRA adapter multiplexing
  • Sleep/wake endpoints (already implemented)
  • Collective RPC for weight sync (already implemented)

The missing piece is placement control.

Proposed API

from ray import serve
from ray.util.placement_group import placement_group

@dataclass
class StaticPlacementConfig:
    placement_group: PlacementGroup
    replica_bundle_mapping: Dict[int, List[int]]  # replica_rank -> bundle_indices
    capture_child_tasks: bool = True

# Usage
pg = placement_group([{"GPU": 1, "CPU": 1}] * 4)
ray.get(pg.ready())

@serve.deployment(
    _placement_info=StaticPlacementConfig(
        placement_group=pg,
        replica_bundle_mapping={
            0: [0, 1],  # Replica 0 uses bundles 0, 1 (for TP=2)
            1: [2, 3],  # Replica 1 uses bundles 2, 3
        },
    ),
    max_ongoing_requests=100,
)
class MyLLMServer:
    ...

Important Semantics of replica_bundle_mapping

The replica_bundle_mapping does not enforce any scheduling policy for child actors spawned by the replica actors. Child actors can still choose to use any bundle indices (e.g., bundle indices 2 and 3 on rank 0 if they want to).

The mapping only affects:

  1. Replica actor placement: The replica actor itself will be scheduled on the first bundle index in its assigned list (e.g., replica 0 → bundle index 0, replica 1 → bundle index 2).
  2. Replica context: serve.get_replica_context().bundle_indices will be set to the assigned bundle indices list, allowing child actors to query this information and make informed placement decisions (e.g., via VLLM_RAY_BUNDLE_INDICES).

Semantics

Aspect Behavior
Replica count Fixed to len(replica_bundle_mapping)
Autoscaling Not supported
Replica actor scheduling Rank N → scheduled on first bundle index in replica_bundle_mapping[N]
Replica context serve.get_replica_context().bundle_indices set to replica_bundle_mapping[N]
Child actor scheduling Not enforced by mapping; child actors can use any bundle indices
Replica failure Restart on same bundle indices
Bundle/node failure Replica enters FAILED state
PG lifecycle External controller's responsibility
Other Serve features Work unchanged

Implementation Sketch

1. Config validation

@dataclass
class StaticPlacementConfig:
    placement_group: PlacementGroup
    replica_bundle_mapping: Dict[int, List[int]]
    capture_child_tasks: bool = True
    
    def __post_init__(self):
        # Ranks must be contiguous: 0, 1, 2, ...
        # Bundle lists must be non-empty
        # Bundle indices must be unique across replicas

2. Scheduler change

# deployment_scheduler.py

def _schedule_replica(self, scheduling_request, ...):
    if scheduling_request._placement_info is not None:
        sp = scheduling_request._placement_info
        rank = self._get_replica_rank(replica_id)
        
        scheduling_strategy = PlacementGroupSchedulingStrategy(
            placement_group=sp.placement_group,
            placement_group_bundle_index=sp.replica_bundle_mapping[rank][0],
            placement_group_capture_child_tasks=sp.capture_child_tasks,
        )
        # Skip PG creation, use external PG directly
        # Set replica_context.bundle_indices = sp.replica_bundle_mapping[rank]
    else:
        # Existing behavior
        ...

3. Stable replica ranks

When a replica restarts, it gets the same rank to land on the same bundles. We can create a new replica for that slot gets the same rank.

Ray Serve LLM Integration

Pass through via deployment_config:

from ray.serve.llm import LLMConfig

config = LLMConfig(
    model_id="meta-llama/Llama-3-8B",
    tensor_parallelism=2,
    deployment_config={
        "_placement_info": StaticPlacementConfig(
            placement_group=pg,
            replica_bundle_mapping={0: [0, 1], 1: [2, 3]},
        ),
    },
)

Constraints

  • Python API onlyPlacementGroup handles are not serializable to YAML
  • No autoscaling — Replica count fixed by bundle mapping
  • External PG ownership — Serve does not create, monitor, or destroy the PG

Alternatives Considered

Alternative Why Rejected
New @serve.static_deployment API Duplicates validation logic; prefer extending existing API
PG lookup by name Adds indirection; names can duplicate
Actor pool without Serve Loses Serve features; maintenance burden

Future options

  • Expose bundle infoserve.get_replica_context().bundle_indices will be available to replica actors, allowing them to set VLLM_RAY_BUNDLE_INDICES for child actors. vLLM can then use these exact indices for placement decisions.

Rollout

  1. Alpha — New API, may change based on feedback
  2. Beta — Stabilize after real-world usage (SkyRL, etc.)
  3. Stable — Consider merging into @serve.deployment if patterns emerge

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementRequest for new feature and/or capabilityllmserveRay Serve Related Issueusability

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions