Skip to content

[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance#14795

Merged
Fridge003 merged 7 commits intosgl-project:mainfrom
ConnorLi96:feature/fix-multi-lora-blocking
Dec 13, 2025
Merged

[Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance#14795
Fridge003 merged 7 commits intosgl-project:mainfrom
ConnorLi96:feature/fix-multi-lora-blocking

Conversation

@ConnorLi96
Copy link
Copy Markdown
Contributor

@ConnorLi96 ConnorLi96 commented Dec 10, 2025

Motivation

Motivation
When LoRA and non-LoRA requests are mixed in the same batch, non-LoRA traffic gets severely blocked when LoRA adapter cache misses occur. This is because:

  1. Base model (uid=None) was always evicted first from the LoRA memory pool, forcing non-LoRA requests to wait for adapter loading
  2. Scheduler stops processing all subsequent requests when encountering a LoRA request that exceeds slot limits, even if non-LoRA requests could still be scheduled

issue code:

if self.enable_lora and not self.tp_worker.can_run_lora_batch(
lora_set
| set([req.lora_id for req in adder.can_run_list])
| set([req.lora_id])
):
self.running_batch.batch_is_full = True
break

Modifications

  1. Pin base model slot in LoRA memory pool (mem_pool.py)
    Exclude None (base model) from eviction candidates
    Ensures non-LoRA requests always have a slot available (all non-LoRA requests share this single slot)
  2. Skip problematic LoRA requests in scheduler (scheduler.py)
    When a LoRA request would exceed slot limits, use continue instead of break
    Allow non-LoRA requests to continue scheduling instead of blocking the entire queue

Accuracy Tests

Benchmarking and Profiling

Test Setup

Baseline Group (No LoRA):
Benchmark 1 (B1) and Benchmark 2 (B2) command

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30001 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 2 \
  --random-input-len 2048 \
  --random-output-len 1024 \
  --disable-ignore-eos \
  --disable-tqdm

Run B1 and B2 at the same time

Mixed Traffic Group (16 adapters):
Benchmark 3 (B3) command

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30001 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 2 \
  --random-input-len 2048 \
  --random-output-len 1024 \
  --disable-ignore-eos \
  --disable-tqdm

Benchmark 4 (B4) command

python3 -m sglang.bench_serving \
  --backend sglang \
  --base-url http://localhost:30001 \
  --dataset-name random \
  --num-prompts 100 \
  --request-rate 2 \
  --random-input-len 2048 \
  --random-output-len 1024 \
  --disable-ignore-eos \
  --disable-tqdm \
  --lora-name \
    adapter1 \
    adapter2 \
    adapter3 \
    adapter4 \
    adapter5 \
    adapter6 \
    adapter7 \
    adapter8 \
    adapter9 \
    adapter10 \
    adapter11 \
    adapter12 \
    adapter13 \
    adapter14 \
    adapter15 \
    adapter16

Run B3 and B4 at the same time. The only difference between B3 and B4 is that B4 has 16 lora adapters to build request.

Before this PR

Non-LoRA Traffic benchmark

Metric Baseline (B1) No LoRA traffic (B3) Impact
Non-LoRA TTFT 29ms 12,717ms 438x slower 🔴
Non-LoRA E2E 4,829ms 19,469ms 4x slower 🔴

LoRA Traffic Benchmark (16 Adapters)

Mean TTFT: 8,666ms
Mean E2E: 15,157ms
Max ITL: 953ms

After this PR

Non-LoRA Traffic benchmark

Metric Baseline (B1) No LoRA traffic (B3) Impact
Non-LoRA TTFT 29ms 73ms 2.5x slower
Non-LoRA E2E 4,829ms 7472ms 1.5x slower

LoRA Traffic Benchmark (16 Adapters)

Mean TTFT: 3,802ms (median: 306ms) - 2.3x faster
Mean E2E: 10,897ms - 1.4x faster
Max ITL: 327ms - 2.9x faster

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ConnorLi96 ConnorLi96 changed the title Feature/fix multi lora blocking Feature/fix multi lora scheduler blocking issue Dec 10, 2025
@ConnorLi96 ConnorLi96 changed the title Feature/fix multi lora scheduler blocking issue Feature/fix multi lora scheduler blocking issue and PIN None LoRA traffic Dec 10, 2025
@ConnorLi96 ConnorLi96 changed the title Feature/fix multi lora scheduler blocking issue and PIN None LoRA traffic Feature/Fix multi lora scheduler blocking issue and PIN None LoRA traffic Dec 10, 2025
@Fridge003
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ConnorLi96 ConnorLi96 changed the title Feature/Fix multi lora scheduler blocking issue and PIN None LoRA traffic Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly Dec 12, 2025
@Fridge003
Copy link
Copy Markdown
Collaborator

All lora-related tests passed

@Fridge003 Fridge003 merged commit 9b9d213 into sgl-project:main Dec 13, 2025
120 of 132 checks passed
Liwansi added a commit to iforgetmyname/sglang that referenced this pull request Dec 13, 2025
…n_eagle3_npu

* 'main' of https://github.com/sgl-project/sglang: (121 commits)
  Super tiny add gsp-fast-prepare (sgl-project#14992)
  Super tiny fix confusing slash_command_handler hint (sgl-project#14976)
  Super tiny remove unused argument (sgl-project#14966)
  [registry] Add a strict mode to model registration (sgl-project#14933)
  Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly (sgl-project#14795)
  Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 (sgl-project#15020)
  [model-gateway] refactor: unify worker management into modular workflow structure (sgl-project#15010)
  Update ci permission (sgl-project#15014)
  Refactor of http and engine entrypoints to allow custom override  (sgl-project#14869)
  Add KV4-capable backend flashmla and update server args (sgl-project#14989)
  Revert several PRs (sgl-project#14958)
  Super tiny extract route_typed_request_once (sgl-project#14951)
  Fix CI by reverting incorrect metric check logic (sgl-project#15004)
  [model-gateway] refactor: workflow engine cleanup and minor optimization (sgl-project#15001)
  [model-gateway] fix: handle workflow deadlock and optimize cycle detection (sgl-project#15000)
  [model-gateway] feat: add DAG parallel execution support and workflow optimization (sgl-project#14999)
  [model-gateway] refactor: extract workflow engine to src/workflow module (sgl-project#14996)
  Update CODEOWNERS for multimodal_gen (sgl-project#14995)
  [diffusion] docker: Tiny fix Docker Hub link in installation documentation (sgl-project#14987)
  [PD] Add decode PP event loop for PD disaggregation (sgl-project#14945)
  ...

# Conflicts:
#	python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
alisonshao added a commit that referenced this pull request Dec 16, 2025
Followup to #14795 which changed eviction behavior to protect the base
model (None) from being evicted first. This ensures non-LoRA requests
always have a slot available.

Update test expectations:
- test_lru_base_model_evicted_last: expect lora1 (LRU adapter) instead of None
- test_fifo_base_model_evicted_last: expect lora1 (first inserted) instead of None

Rename tests to reflect the intended behavior: base model is evicted
last, not first.
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
@ConnorLi96 ConnorLi96 changed the title Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly [Feature] Multi lora optimization - resolve scheduler blocking issue and save Non-Lora inference performance Jan 8, 2026
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants