[Serve] Add generic actor registration API for shutdown cleanup#60067
[Serve] Add generic actor registration API for shutdown cleanup#60067abrarsheikh merged 27 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where detached actors in Serve sub-namespaces were not cleaned up during shutdown. The fix introduces a new function, _kill_actors_in_serve_subnamespaces, which is called during the shutdown process to find and kill these actors. The implementation is sound and is accompanied by a good regression test. My review includes a suggestion to improve the efficiency and robustness of the new cleanup function by leveraging more advanced filtering capabilities of the list_actors API and ensuring safe access to actor data.
7aaf127 to
5db0cff
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a necessary cleanup mechanism for detached actors in serve sub-namespaces during shutdown, preventing resource leaks and stale state across restarts. The implementation is clean and includes a new method, _kill_actors_in_serve_subnamespaces, which is appropriately called during the shutdown sequence. The accompanying regression test is thorough and effectively validates the new functionality. I have one suggestion to make the actor cleanup logic more precise.
5db0cff to
2f9d703
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2f9d703 to
504bdb8
Compare
|
Addressed bot review comments:
|
Previously, actors in namespaces like 'serve::app::deployment' (e.g., PrefixTreeActor) would survive serve.shutdown() because the cleanup only targeted the main 'serve' namespace. This adds _kill_actors_in_serve_subnamespaces() to the controller's shutdown method to clean up any actors in namespaces starting with 'serve::' before the controller exits. The test verifies that: 1. Actors in sub-namespaces are killed on shutdown 2. After restart, a fresh tree is created with no stale state Fixes stale replica accumulation in PrefixTreeActor after serve restart. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
504bdb8 to
d23a9bd
Compare
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ix-tree-cleanup
abrarsheikh
left a comment
There was a problem hiding this comment.
what is special about PrefixTreeActor compared to other deployments that serve manages? Why are those actors getting killed and not PrefixTreeActor
|
|
||
| def _kill_actors_in_serve_subnamespaces(self) -> None: | ||
| """Kill all actors in namespaces that start with 'serve::'.""" | ||
| from ray.util.state import list_actors |
There was a problem hiding this comment.
move imports to top of file
There was a problem hiding this comment.
Needed to avoid a API docs CI failure: https://buildkite.com/ray-project/premerge/builds/57532#019bb8eb-400f-4eea-9c63-5ea269d4a85c
I will add a comment
|
@abrarsheikh I implemented it this way assuming that all namespaces under |
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Instead of creating PrefixTreeActor in a subnamespace (serve::app::deployment) and needing special cleanup logic, use a unique actor name (LlmPrefixTreeActor_app_deployment) in the main serve namespace. This allows the actor to be cleaned up through normal Serve cleanup mechanisms, removing the need for _kill_actors_in_serve_subnamespaces(). The test verifies that actors in the serve namespace with unique names are properly cleaned up on shutdown, preventing stale tenant state after restart. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Move LLM_PREFIX_TREE_ACTOR_NAME_PREFIX import inside _kill_prefix_tree_actors() to avoid triggering ray.llm observability setup at module load time, which breaks Ray Client mode. Also switch from list_actors (State API) to list_named_actors which has proper Ray Client support. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
cc @abrarsheikh to chime in.
| """ | ||
| return self.kv_store.get(CONFIG_CHECKPOINT_KEY) is None | ||
|
|
||
| def _kill_prefix_tree_actors(self) -> None: |
There was a problem hiding this comment.
I am very iffy on this hack. because this is making serve a dependency between serve llm and serve in the opposite way. Not sure how @abrarsheikh feels about it :)
Is there any other proposal that is less hacky that can implemented quickly?
Replace the hacky LLM-specific _kill_prefix_tree_actors() cleanup with a generic register_shutdown_cleanup_actor() API that allows any deployment to register auxiliary actors for cleanup on serve.shutdown(). Changes: - Add _registered_cleanup_actors list and register_shutdown_cleanup_actor() to ServeController - Add _kill_registered_cleanup_actors() called during shutdown - Update PrefixCacheAffinityRouter to register its tree actor with controller - Remove LLM_PREFIX_TREE_ACTOR_NAME_PREFIX constant (no longer needed externally) - Update tests to use injected tree actors where serve controller isn't running This eliminates the Serve → LLM import dependency while providing a general-purpose mechanism for deployment-scoped actor cleanup. Limitations (acceptable for stopgap solution): - Actors only cleaned on serve.shutdown(), not deployment deletion - Registration not persisted across controller restarts - For full lifecycle management: see ray-project#60359 Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Resolve conflicts: - controller.py: Keep both _registered_cleanup_actors and _health_metrics_tracker - test_standalone.py: Keep new test and apply upstream's ray_shutdown param to test_deployment Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Change _registered_cleanup_actors from List to Dict keyed by actor ID. This prevents the same actor from being registered multiple times when multiple router instances share a tree actor via get_if_exists=True. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
the implementation and requirement make sense. Leaving the nits / full review to @abrarsheikh . Thanks @eicherseiji
|
|
||
| This allows deployments to register auxiliary actors (like caches, | ||
| coordinators, etc.) that should be cleaned up when Serve shuts down. | ||
| The actors must use lifetime="detached" to survive replica restarts, |
There was a problem hiding this comment.
why must the actor be "detached"? will it cause any implications if it's not detached?
There was a problem hiding this comment.
so I understand how we want it to survive the replica restarts, but is there a layer in-between replica and serve shutdown? like deployment shutdown? I don't know if my question makes sense.
There was a problem hiding this comment.
If not detached, it will be killed when the replica that created it dies. I.e. autoscaling will break the router. And agree, deployment-scoped would be better. I think we can add that with the full fix in #60359
| self._tree_actor = PrefixTreeActor.options( | ||
| name=actor_name, | ||
| namespace=actor_namespace, | ||
| get_if_exists=True, | ||
| lifetime="detached", | ||
| ).remote() |
There was a problem hiding this comment.
does core guarantee only one actor is created if this call happens concurrently ?
There was a problem hiding this comment.
Yes, the potential race is handled explicitly in the Core code path:
Line 1536 in edc14c5
|
|
||
| Note: Registered actors are NOT persisted across controller restarts. | ||
| For full persistence, use controller-managed deployment-scoped actors | ||
| (see https://github.com/ray-project/ray/issues/60359). |
There was a problem hiding this comment.
+1, I think we should really do what's being proposed in the issue. Make it available through both declarative and imperative serve APIs. We already have a few usecases for this
- Queue Monitor Actor for async inferance - deployment scoped
- Global singleton priority request queue - deployment scoped
| """ | ||
| return self.kv_store.get(CONFIG_CHECKPOINT_KEY) is None | ||
|
|
||
| def register_shutdown_cleanup_actor(self, actor_handle: ActorHandle) -> None: |
There was a problem hiding this comment.
i suggest keeping this private. only available to LLM folks, since we are friends. We can do a robust implementation as proposed in #60359 and then open it up to public.
| def register_shutdown_cleanup_actor(self, actor_handle: ActorHandle) -> None: | |
| def _register_shutdown_cleanup_actor(self, actor_handle: ActorHandle) -> None: |
Signed-off-by: Seiji Eicher <seiji@anyscale.com> # Conflicts: # python/ray/serve/tests/test_standalone.py
Rename to _register_shutdown_cleanup_actor since this is an internal API for LLM folks only. A more robust public API will be implemented later per issue ray-project#60359. Addresses review feedback from Abrar on ray-project#60067. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
|
Passed on retry, port cleanup issue unrelated to this change |
…project#60067) ## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see ray-project#60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…project#60067) ## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see ray-project#60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see #60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see #60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
…project#60067) ## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see ray-project#60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…project#60067) ## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see ray-project#60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…project#60067) ## Summary Adds a `register_shutdown_cleanup_actor()` method to ServeController that allows deployments to register auxiliary actors for cleanup on `serve.shutdown()`. The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency. ## Changes - Add `_registered_cleanup_actors` list and `register_shutdown_cleanup_actor()` to controller - Add `_kill_registered_cleanup_actors()` called during shutdown - Update PrefixCacheAffinityRouter to register tree actor with controller - Replace old test with integration test for the new API ## Limitations (stopgap solution) - Actors only cleaned on `serve.shutdown()`, not deployment deletion - Registration not persisted across controller restarts - Full lifecycle management: see ray-project#60359 ## Checks - [x] I've signed off every commit - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. - [x] I've made sure the tests are passing. - Testing Strategy - [x] Unit tests - [x] Integration test --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Summary
Adds a
register_shutdown_cleanup_actor()method to ServeController that allows deployments to register auxiliary actors for cleanup onserve.shutdown().The LLM prefix-aware router uses this API to register its PrefixTreeActor, eliminating the Serve → LLM import dependency.
Changes
_registered_cleanup_actorslist andregister_shutdown_cleanup_actor()to controller_kill_registered_cleanup_actors()called during shutdownLimitations (stopgap solution)
serve.shutdown(), not deployment deletionChecks
scripts/format.shto lint the changes in this PR.