Skip to content

[Serve] Add controller-managed deployment-scoped actors #60359

@eicherseiji

Description

@eicherseiji

Problem

The PrefixCacheAffinityRouter creates a PrefixTreeActor to track prefix cache state across replicas. Currently this actor is created with lifetime="detached" which causes stale state to persist after serve.shutdown() (reported by @kw-anyscale).

Removing lifetime="detached" fixes the stale state issue, but introduces a fault tolerance problem: the actor is owned by whichever replica/proxy creates it first. If that owner dies, the actor dies, and other replicas get ActorDiedError.

Proposed Solution

Add a controller API for creating deployment-scoped actors:

controller.get_or_create_deployment_actor(
    deployment_id: DeploymentID,
    actor_name: str,
    actor_class: type,
    *args, **kwargs
) -> ActorHandle

The actor would:

  • Be created with lifetime="detached" (survives controller crashes)
  • Use deterministic naming for recovery (e.g., SERVE_DEPLOYMENT_ACTOR::app::deployment::name)
  • Be checkpointed to KV store (controller can recover references on restart)
  • Be automatically cleaned up when deployment is deleted or serve.shutdown() is called

This follows the same pattern as replica actors, which also use detached lifetime + checkpointing + explicit cleanup.

Implementation Sketch

  1. Controller: Add _deployment_actors: Dict[DeploymentID, Dict[str, ActorHandle]] tracking
  2. Controller: Add get_or_create_deployment_actor() method
  3. Controller: Checkpoint actor names to KV store
  4. Controller: Recovery logic in __init__ to look up actors by name
  5. DeploymentStateManager: Kill associated actors in deletion path (step 9 of update loop)
  6. Controller.shutdown(): Kill all deployment actors before controller exits
  7. Router: Call controller API instead of creating actor directly

Context

Discussion: #60067
Related: #58835 (introduced deployment-specific namespacing for prefix trees)

Metadata

Metadata

Labels

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions