[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance#14795
Merged
Fridge003 merged 7 commits intosgl-project:mainfrom Dec 13, 2025
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Fridge003
approved these changes
Dec 10, 2025
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/rerun-failed-ci |
Collaborator
|
All lora-related tests passed |
Liwansi
added a commit
to iforgetmyname/sglang
that referenced
this pull request
Dec 13, 2025
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (121 commits) Super tiny add gsp-fast-prepare (sgl-project#14992) Super tiny fix confusing slash_command_handler hint (sgl-project#14976) Super tiny remove unused argument (sgl-project#14966) [registry] Add a strict mode to model registration (sgl-project#14933) Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly (sgl-project#14795) Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 (sgl-project#15020) [model-gateway] refactor: unify worker management into modular workflow structure (sgl-project#15010) Update ci permission (sgl-project#15014) Refactor of http and engine entrypoints to allow custom override (sgl-project#14869) Add KV4-capable backend flashmla and update server args (sgl-project#14989) Revert several PRs (sgl-project#14958) Super tiny extract route_typed_request_once (sgl-project#14951) Fix CI by reverting incorrect metric check logic (sgl-project#15004) [model-gateway] refactor: workflow engine cleanup and minor optimization (sgl-project#15001) [model-gateway] fix: handle workflow deadlock and optimize cycle detection (sgl-project#15000) [model-gateway] feat: add DAG parallel execution support and workflow optimization (sgl-project#14999) [model-gateway] refactor: extract workflow engine to src/workflow module (sgl-project#14996) Update CODEOWNERS for multimodal_gen (sgl-project#14995) [diffusion] docker: Tiny fix Docker Hub link in installation documentation (sgl-project#14987) [PD] Add decode PP event loop for PD disaggregation (sgl-project#14945) ... # Conflicts: # python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
3 tasks
This was referenced Dec 15, 2025
alisonshao
added a commit
that referenced
this pull request
Dec 16, 2025
Followup to #14795 which changed eviction behavior to protect the base model (None) from being evicted first. This ensures non-LoRA requests always have a slot available. Update test expectations: - test_lru_base_model_evicted_last: expect lora1 (LRU adapter) instead of None - test_fifo_base_model_evicted_last: expect lora1 (first inserted) instead of None Rename tests to reflect the intended behavior: base model is evicted last, not first.
2 tasks
Prozac614
pushed a commit
to Prozac614/sglang
that referenced
this pull request
Dec 17, 2025
YChange01
pushed a commit
to YChange01/sglang
that referenced
this pull request
Jan 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Motivation
When LoRA and non-LoRA requests are mixed in the same batch, non-LoRA traffic gets severely blocked when LoRA adapter cache misses occur. This is because:
issue code:
sglang/python/sglang/srt/managers/scheduler.py
Lines 1792 to 1798 in 7c6fb3a
Modifications
Exclude None (base model) from eviction candidates
Ensures non-LoRA requests always have a slot available (all non-LoRA requests share this single slot)
When a LoRA request would exceed slot limits, use
continueinstead ofbreakAllow non-LoRA requests to continue scheduling instead of blocking the entire queue
Accuracy Tests
Benchmarking and Profiling
Test Setup
Baseline Group (No LoRA):
Benchmark 1 (B1) and Benchmark 2 (B2) command
Run B1 and B2 at the same time
Mixed Traffic Group (16 adapters):
Benchmark 3 (B3) command
Benchmark 4 (B4) command
Run B3 and B4 at the same time. The only difference between B3 and B4 is that B4 has 16 lora adapters to build request.
Before this PR
Non-LoRA Traffic benchmark
LoRA Traffic Benchmark (16 Adapters)
Mean TTFT: 8,666ms
Mean E2E: 15,157ms
Max ITL: 953ms
After this PR
Non-LoRA Traffic benchmark
LoRA Traffic Benchmark (16 Adapters)
Mean TTFT: 3,802ms (median: 306ms) - 2.3x faster
Mean E2E: 10,897ms - 1.4x faster
Max ITL: 327ms - 2.9x faster
Checklist