Skip to content

[Feature] add LoRADrainer to address high P99 TTFT#17913

Merged
Fridge003 merged 13 commits intosgl-project:mainfrom
glenliu21:lora_high_ttft
May 2, 2026
Merged

[Feature] add LoRADrainer to address high P99 TTFT#17913
Fridge003 merged 13 commits intosgl-project:mainfrom
glenliu21:lora_high_ttft

Conversation

@glenliu21
Copy link
Copy Markdown
Contributor

Motivation

Currently, our LoRA implementation suffers from an extremely high P99 TTFT issue. For instance, running with the below scripts on an A100-SXM4-80GB:

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --max-loaded-loras 6 \
    --max-loras-per-batch 3 \
    --lora-paths \
        adapter0=faridlazuarda/valadapt-llama-3.1-8B-it-chinese \
        adapter1=LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching \
        adapter2=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
        adapter3=pbevan11/llama-3.1-8b-ocr-correction \
        adapter4=reissbaker/llama-3.1-8b-abliterated-lora \
        adapter5=Roblox/Llama-3.1-8B-Instruct-RobloxGuard-1.0
python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30000 \
  --dataset-name random \
  --num-prompts 200 \
  --request-rate 4 \
  --random-input-len 512 \
  --random-output-len 512 \
  --lora-name \
    adapter0 \
    adapter1 \
    adapter2 \
    adapter3 \
    adapter4 \
    adapter5

gives us the following results:

----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12060.28
Median E2E Latency (ms):                 7051.01
P90 E2E Latency (ms):                    32927.73
P99 E2E Latency (ms):                    45870.02
---------------Time to First Token----------------
Mean TTFT (ms):                          7910.15
Median TTFT (ms):                        83.30
P99 TTFT (ms):                           39550.49

That means that 1% of requests take almost 40 seconds to schedule, compared to the median which is 83 ms.

Modifications

  • Introduce a LoRADrainer class to force hot adapters to start draining for cold adapters that have been starved

Accuracy Tests

  • Add unit tests in test_lora_drainer.py

Benchmarking and Profiling

Metric main PR % Change
Mean E2E Latency ms 12060.28 8502.00 -29.5%
Median E2E Latency ms 7051.01 8069.63
P90 E2E Latency ms 32927.73 15974.35 -51.5%
P99 E2E Latency ms 45870.02 19969.72 -56.5%
Mean TTFT ms 7910.15 4279.28 -45.9%
Median TTFT ms 83.30 3728.36
P99 TTFT ms 39550.49 12081.23 -69.5%

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the lora label Jan 29, 2026
@glenliu21
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a LoRADrainer to address high P99 TTFT for LoRA requests by preventing adapter starvation. The implementation is well-structured, introducing a new LoRADrainer class with a draining mechanism that seems sound. The integration into the Scheduler is clean, and the addition of unit tests for the new functionality is a great practice. I have one minor suggestion to improve code robustness in the scheduler.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
@Fridge003
Copy link
Copy Markdown
Collaborator

From the benchmark, it seems that this draining strategy will harm median latency/TTFT.
Can we control this feature with a server argument, so we can turn if off when better median metric is needed

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Feb 1, 2026
@glenliu21
Copy link
Copy Markdown
Contributor Author

From the benchmark, it seems that this draining strategy will harm median latency/TTFT. Can we control this feature with a server argument, so we can turn if off when better median metric is needed

Added - it is turned off by default.

@yushengsu-thu yushengsu-thu self-assigned this Apr 16, 2026
@glenliu21
Copy link
Copy Markdown
Contributor Author

/tag-run-ci-label

@glenliu21
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@glenliu21
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@glenliu21
Copy link
Copy Markdown
Contributor Author

glenliu21 commented Apr 20, 2026

/rerun-failed-ci again

@glenliu21 glenliu21 requested a review from wisclmy0611 as a code owner April 21, 2026 12:10
@glenliu21
Copy link
Copy Markdown
Contributor Author

glenliu21 commented Apr 29, 2026

/rerun-failed-ci again

@yushengsu-thu yushengsu-thu enabled auto-merge (squash) April 29, 2026 07:04
@glenliu21
Copy link
Copy Markdown
Contributor Author

glenliu21 commented Apr 30, 2026

/rerun-failed-ci again

auto-merge was automatically disabled May 1, 2026 23:39

Head branch was pushed to by a user without write access

@glenliu21 glenliu21 requested a review from JustinTong0323 as a code owner May 1, 2026 23:39
@yushengsu-thu yushengsu-thu enabled auto-merge (squash) May 2, 2026 00:39
@glenliu21
Copy link
Copy Markdown
Contributor Author

glenliu21 commented May 2, 2026

/rerun-failed-ci again

1 similar comment
@glenliu21
Copy link
Copy Markdown
Contributor Author

glenliu21 commented May 2, 2026

/rerun-failed-ci again

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

@Fridge003 I think it's good to merge now

@Fridge003 Fridge003 disabled auto-merge May 2, 2026 23:13
@Fridge003 Fridge003 merged commit 76b9c8d into sgl-project:main May 2, 2026
628 of 692 checks passed
@glenliu21 glenliu21 deleted the lora_high_ttft branch May 2, 2026 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lora run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants