[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance by ConnorLi96 · Pull Request #14795 · sgl-project/sglang

ConnorLi96 · 2025-12-10T06:44:38Z

Motivation

Motivation
When LoRA and non-LoRA requests are mixed in the same batch, non-LoRA traffic gets severely blocked when LoRA adapter cache misses occur. This is because:

Base model (uid=None) was always evicted first from the LoRA memory pool, forcing non-LoRA requests to wait for adapter loading
Scheduler stops processing all subsequent requests when encountering a LoRA request that exceeds slot limits, even if non-LoRA requests could still be scheduled

issue code:

sglang/python/sglang/srt/managers/scheduler.py

Lines 1792 to 1798 in 7c6fb3a

    
           if self.enable_lora and not self.tp_worker.can_run_lora_batch( 
        
               lora_set 
        
               | set([req.lora_id for req in adder.can_run_list]) 
        
               | set([req.lora_id]) 
        
           ): 
        
               self.running_batch.batch_is_full = True 
        
               break

Modifications

Pin base model slot in LoRA memory pool (mem_pool.py)
Exclude None (base model) from eviction candidates
Ensures non-LoRA requests always have a slot available (all non-LoRA requests share this single slot)
Skip problematic LoRA requests in scheduler (scheduler.py)
When a LoRA request would exceed slot limits, use continue instead of break
Allow non-LoRA requests to continue scheduling instead of blocking the entire queue

Accuracy Tests

Benchmarking and Profiling

Test Setup

Baseline Group (No LoRA):
Benchmark 1 (B1) and Benchmark 2 (B2) command

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30001 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 2 \
  --random-input-len 2048 \
  --random-output-len 1024 \
  --disable-ignore-eos \
  --disable-tqdm

Run B1 and B2 at the same time

Mixed Traffic Group (16 adapters):
Benchmark 3 (B3) command

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30001 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 2 \
  --random-input-len 2048 \
  --random-output-len 1024 \
  --disable-ignore-eos \
  --disable-tqdm

Benchmark 4 (B4) command

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30001 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 2 \
  --random-input-len 2048 \
  --random-output-len 1024 \
  --disable-ignore-eos \
  --disable-tqdm \
  --lora-name \
    adapter1 \
    adapter2 \
    adapter3 \
    adapter4 \
    adapter5 \
    adapter6 \
    adapter7 \
    adapter8 \
    adapter9 \
    adapter10 \
    adapter11 \
    adapter12 \
    adapter13 \
    adapter14 \
    adapter15 \
    adapter16

Run B3 and B4 at the same time. The only difference between B3 and B4 is that B4 has 16 lora adapters to build request.

Before this PR

Non-LoRA Traffic benchmark

Metric	Baseline (B1)	No LoRA traffic (B3)	Impact
Non-LoRA TTFT	29ms	12,717ms	438x slower 🔴
Non-LoRA E2E	4,829ms	19,469ms	4x slower 🔴

LoRA Traffic Benchmark (16 Adapters)

Mean TTFT: 8,666ms
Mean E2E: 15,157ms
Max ITL: 953ms

After this PR

Non-LoRA Traffic benchmark

Metric	Baseline (B1)	No LoRA traffic (B3)	Impact
Non-LoRA TTFT	29ms	73ms	2.5x slower
Non-LoRA E2E	4,829ms	7472ms	1.5x slower

LoRA Traffic Benchmark (16 Adapters)

Mean TTFT: 3,802ms (median: 306ms) - 2.3x faster
Mean E2E: 10,897ms - 1.4x faster
Max ITL: 327ms - 2.9x faster

Checklist

[ x] Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
[ x] Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
[ x] Follow the SGLang code style guidance.
[ x] Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-10T06:44:41Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2025-12-10T19:46:43Z

/tag-and-rerun-ci

Fridge003 · 2025-12-10T22:04:28Z

/rerun-failed-ci

Fridge003 · 2025-12-13T01:12:59Z

All lora-related tests passed

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (121 commits) Super tiny add gsp-fast-prepare (sgl-project#14992) Super tiny fix confusing slash_command_handler hint (sgl-project#14976) Super tiny remove unused argument (sgl-project#14966) [registry] Add a strict mode to model registration (sgl-project#14933) Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly (sgl-project#14795) Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 (sgl-project#15020) [model-gateway] refactor: unify worker management into modular workflow structure (sgl-project#15010) Update ci permission (sgl-project#15014) Refactor of http and engine entrypoints to allow custom override (sgl-project#14869) Add KV4-capable backend flashmla and update server args (sgl-project#14989) Revert several PRs (sgl-project#14958) Super tiny extract route_typed_request_once (sgl-project#14951) Fix CI by reverting incorrect metric check logic (sgl-project#15004) [model-gateway] refactor: workflow engine cleanup and minor optimization (sgl-project#15001) [model-gateway] fix: handle workflow deadlock and optimize cycle detection (sgl-project#15000) [model-gateway] feat: add DAG parallel execution support and workflow optimization (sgl-project#14999) [model-gateway] refactor: extract workflow engine to src/workflow module (sgl-project#14996) Update CODEOWNERS for multimodal_gen (sgl-project#14995) [diffusion] docker: Tiny fix Docker Hub link in installation documentation (sgl-project#14987) [PD] Add decode PP event loop for PD disaggregation (sgl-project#14945) ... # Conflicts: # python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

Followup to #14795 which changed eviction behavior to protect the base model (None) from being evicted first. This ensures non-LoRA requests always have a slot available. Update test expectations: - test_lru_base_model_evicted_last: expect lora1 (LRU adapter) instead of None - test_fifo_base_model_evicted_last: expect lora1 (first inserted) instead of None Rename tests to reflect the intended behavior: base model is evicted last, not first.

…astly (sgl-project#14795)

ConnorLi96 added 2 commits December 9, 2025 22:12

Improve LoRA eviction policy and memory pool management

afd51a4

Fix formatting and linting issues

b8a6ac2

ConnorLi96 requested review from Fridge003, Ying1123, hnyls2002, lifuhuang, merrymercy, xiezhq-hermann and zhyncs as code owners December 10, 2025 06:44

ConnorLi96 changed the title ~~Feature/fix multi lora blocking~~ Feature/fix multi lora scheduler blocking issue Dec 10, 2025

ConnorLi96 changed the title ~~Feature/fix multi lora scheduler blocking issue~~ Feature/fix multi lora scheduler blocking issue and PIN None LoRA traffic Dec 10, 2025

ConnorLi96 changed the title ~~Feature/fix multi lora scheduler blocking issue and PIN None LoRA traffic~~ Feature/Fix multi lora scheduler blocking issue and PIN None LoRA traffic Dec 10, 2025

Fridge003 approved these changes Dec 10, 2025

View reviewed changes

Fridge003 added 2 commits December 10, 2025 11:35

Merge branch 'main' into feature/fix-multi-lora-blocking

07da824

Merge branch 'main' into feature/fix-multi-lora-blocking

5bfc7e5

github-actions Bot added the run-ci label Dec 10, 2025

fix bug and adjust the None LoRA priority

3fac146

ConnorLi96 changed the title ~~Feature/Fix multi lora scheduler blocking issue and PIN None LoRA traffic~~ Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly Dec 12, 2025

Fridge003 and others added 2 commits December 12, 2025 11:51

Merge branch 'main' into feature/fix-multi-lora-blocking

784b789

Merge branch 'main' into feature/fix-multi-lora-blocking

86f16db

Fridge003 merged commit 9b9d213 into sgl-project:main Dec 13, 2025
120 of 132 checks passed

glenliu21 mentioned this pull request Dec 14, 2025

[FIX] add lora draining to prevent edge case of a request never scheduling #15129

Closed

3 tasks

This was referenced Dec 15, 2025

LoRA adapter switching profile #13349

Closed

Add LoRA metrics for potential auto scaling #15149

Merged

alisonshao mentioned this pull request Dec 16, 2025

[Test] Update LoRA eviction policy tests to match current behavior #15283

Merged

2 tasks

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

Feature/Fix multi lora scheduler blocking issue and evict LoRA None l…

c75a3c3

…astly (sgl-project#14795)

ConnorLi96 changed the title ~~Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly~~ [Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance Jan 8, 2026

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Feature/Fix multi lora scheduler blocking issue and evict LoRA None l…

68b4187

…astly (sgl-project#14795)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance#14795

[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance#14795
Fridge003 merged 7 commits intosgl-project:mainfrom
ConnorLi96:feature/fix-multi-lora-blocking

ConnorLi96 commented Dec 10, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Dec 10, 2025

Uh oh!

Fridge003 commented Dec 10, 2025

Uh oh!

Fridge003 commented Dec 10, 2025

Uh oh!

Fridge003 commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if self.enable_lora and not self.tp_worker.can_run_lora_batch(
	lora_set
	\| set([req.lora_id for req in adder.can_run_list])
	\| set([req.lora_id])
	):
	self.running_batch.batch_is_full = True
	break

Conversation

ConnorLi96 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Test Setup

Before this PR

Non-LoRA Traffic benchmark

LoRA Traffic Benchmark (16 Adapters)

After this PR

Non-LoRA Traffic benchmark

LoRA Traffic Benchmark (16 Adapters)

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 10, 2025

Uh oh!

Fridge003 commented Dec 10, 2025

Uh oh!

Fridge003 commented Dec 10, 2025

Uh oh!

Fridge003 commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ConnorLi96 commented Dec 10, 2025 •

edited

Loading