Skip to content

Conversation

@cr7258
Copy link
Contributor

@cr7258 cr7258 commented Apr 27, 2025

What this PR does / why we need it

Add preStop hook for llamacpp and tgi in the BackendRuntime to ensure graceful termination. I didn't add preStop hook for Ollama and SGLang, because:

[2025-04-26 10:13:01] SIGTERM received. signum=None frame=None. Draining requests and shutting down...
[2025-04-26 10:13:04] Gracefully exiting... remaining number of requests 3
[2025-04-26 10:13:09] Gracefully exiting... remaining number of requests 2
[2025-04-26 10:13:14] Gracefully exiting... remaining number of requests 2
2025-04-26 10:13:18,881 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-04-26 10:13:18 TP0] Decode batch. #running-req: 1, #token: 11, token usage: 0.00, gen throughput (token/s): 0.85, #queue-req: 0, 
[2025-04-26 10:13:19] Gracefully exiting... remaining number of requests 2
[2025-04-26 10:13:19 TP0] Decode batch. #running-req: 1, #token: 51, token usage: 0.00, gen throughput (token/s): 342.30, #queue-req: 0, 
[2025-04-26 10:13:19 TP0] Decode batch. #running-req: 1, #token: 91, token usage: 0.00, gen throughput (token/s): 381.66, #queue-req: 0, 
[2025-04-26 10:13:19] INFO:     127.0.0.1:50206 - "POST /v1/completions HTTP/1.1" 200 OK
[2025-04-26 10:13:21 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-26 10:13:24] Gracefully exiting... remaining number of requests 0

In this PR, I also increase the terminationGracePeriodSeconds from default 30s to 130. Generally, the termination grace period needs to last longer than the slowest request we expect to serve plus any extra time spent waiting for load balancers to take the model server out of rotation. For the detailed explanation, please see here.

Which issue(s) this PR fixes

Fixes #320

Special notes for your reviewer

llamacpp
related doc
output logs:

Terminating: Running: 1, Waiting: 0
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
Terminating: Running: 1, Waiting: 0
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
Terminating: Running: 1, Waiting: 0
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  cancel_tasks: cancel task, id_task = 0
slot      release: id  0 | task 0 | stop processing: n_past = 2810, truncated = 0
srv  update_slots: all slots are idle
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /metrics 127.0.0.1 200
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /metrics 127.0.0.1 200
Terminating: No active or waiting requests, safe to terminate
srv    operator(): operator(): cleaning up before exit...

tgi:

related doc
output logs:

Terminating: Running: 1, Waiting: 0
2025-04-27T05:28:12.853024Z  INFO completions{total_time="4.198707127s" validation_time="168.004µs" queue_time="64.762µs" inference_time="4.198474621s" time_per_token="4.198474ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success
2025-04-27T05:28:14.820936Z  INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 1008
Terminating: Running: 1, Waiting: 0
2025-04-27T05:28:18.999206Z  INFO completions{total_time="4.178477298s" validation_time="177.684µs" queue_time="74.152µs" inference_time="4.178225622s" time_per_token="4.178225ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success
Terminating: No active or waiting requests, safe to terminate

Does this PR introduce a user-facing change?

add preStop hook for llamacpp and tgi in the BackendRuntime

@InftyAI-Agent InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels Apr 27, 2025
@InftyAI-Agent InftyAI-Agent requested a review from kerthcet April 27, 2025 06:07
@kerthcet
Copy link
Member

Additionally, SGLang natively supports graceful termination, see logs below.

Thanks @cr7258 I didn't know this before, I was wondering why inference engine don't support this, which they should.

Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment.

@cr7258 cr7258 requested a review from kerthcet April 27, 2025 13:32
@kerthcet
Copy link
Member

/lgtm
/approve
/kind feature

Thanks!

@InftyAI-Agent InftyAI-Agent added lgtm Looks good to me, indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels Apr 27, 2025
@kerthcet
Copy link
Member

/lgtm
/approve
/kind feature

Thanks!

1 similar comment
@kerthcet
Copy link
Member

/lgtm
/approve
/kind feature

Thanks!

@kerthcet
Copy link
Member

/triage accepted

@InftyAI-Agent InftyAI-Agent added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a label and requires one. labels Apr 27, 2025
@InftyAI-Agent InftyAI-Agent removed the lgtm Looks good to me, indicates that a PR is ready to be merged. label Apr 27, 2025
@kerthcet
Copy link
Member

/lgtm

@InftyAI-Agent InftyAI-Agent added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Apr 27, 2025
@InftyAI-Agent InftyAI-Agent merged commit fb95a7d into InftyAI:main Apr 27, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. feature Categorizes issue or PR as related to a new feature. lgtm Looks good to me, indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a label and requires one. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support preStop lifecycle for backendRuntimes

4 participants