-
-
Notifications
You must be signed in to change notification settings - Fork 44
feat: add preStop hook for llamacpp and tgi in the BackendRuntime #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @cr7258 I didn't know this before, I was wondering why inference engine don't support this, which they should. |
kerthcet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one comment.
|
/lgtm Thanks! |
|
/lgtm Thanks! |
1 similar comment
|
/lgtm Thanks! |
|
/triage accepted |
|
/lgtm |
What this PR does / why we need it
Add preStop hook for llamacpp and tgi in the BackendRuntime to ensure graceful termination. I didn't add preStop hook for Ollama and SGLang, because:
In this PR, I also increase the
terminationGracePeriodSecondsfrom default30sto130. Generally, the termination grace period needs to last longer than the slowest request we expect to serve plus any extra time spent waiting for load balancers to take the model server out of rotation. For the detailed explanation, please see here.Which issue(s) this PR fixes
Fixes #320
Special notes for your reviewer
llamacpp
related doc
output logs:
tgi:
related doc
output logs:
Terminating: Running: 1, Waiting: 0 2025-04-27T05:28:12.853024Z INFO completions{total_time="4.198707127s" validation_time="168.004µs" queue_time="64.762µs" inference_time="4.198474621s" time_per_token="4.198474ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success 2025-04-27T05:28:14.820936Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 1008 Terminating: Running: 1, Waiting: 0 2025-04-27T05:28:18.999206Z INFO completions{total_time="4.178477298s" validation_time="177.684µs" queue_time="74.152µs" inference_time="4.178225622s" time_per_token="4.178225ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success Terminating: No active or waiting requests, safe to terminateDoes this PR introduce a user-facing change?