-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
Problem
The PrefixCacheAffinityRouter creates a PrefixTreeActor to track prefix cache state across replicas. Currently this actor is created with lifetime="detached" which causes stale state to persist after serve.shutdown() (reported by @kw-anyscale).
Removing lifetime="detached" fixes the stale state issue, but introduces a fault tolerance problem: the actor is owned by whichever replica/proxy creates it first. If that owner dies, the actor dies, and other replicas get ActorDiedError.
Proposed Solution
Add a controller API for creating deployment-scoped actors:
controller.get_or_create_deployment_actor(
deployment_id: DeploymentID,
actor_name: str,
actor_class: type,
*args, **kwargs
) -> ActorHandleThe actor would:
- Be created with
lifetime="detached"(survives controller crashes) - Use deterministic naming for recovery (e.g.,
SERVE_DEPLOYMENT_ACTOR::app::deployment::name) - Be checkpointed to KV store (controller can recover references on restart)
- Be automatically cleaned up when deployment is deleted or
serve.shutdown()is called
This follows the same pattern as replica actors, which also use detached lifetime + checkpointing + explicit cleanup.
Implementation Sketch
- Controller: Add
_deployment_actors: Dict[DeploymentID, Dict[str, ActorHandle]]tracking - Controller: Add
get_or_create_deployment_actor()method - Controller: Checkpoint actor names to KV store
- Controller: Recovery logic in
__init__to look up actors by name - DeploymentStateManager: Kill associated actors in deletion path (step 9 of update loop)
- Controller.shutdown(): Kill all deployment actors before controller exits
- Router: Call controller API instead of creating actor directly
Context
Discussion: #60067
Related: #58835 (introduced deployment-specific namespacing for prefix trees)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status