Dual MoE CUDA graph capture for lora/nolora batches by sshleifer · Pull Request #22809 · sgl-project/sglang

sshleifer · 2026-04-14T13:45:41Z

Summary

When LoRA is enabled with a triton MoE backend, capture two sets of CUDA graphs per batch size: one with LoRA kernels recorded and one without. At replay time, batches without active adapters use the faster nolora graph, avoiding LoRA kernel overhead entirely. Controlled by --record-nolora-graph (default True), auto-disabled for non-triton MoE backends.

server_args.py: Add --record-nolora-graph / --no-record-nolora-graph flag
moe/utils.py: Add RECORD_NOLORA_GRAPH config with triton backend validation and should_record_nolora_graph() accessor
cuda_graph_runner.py: Dual capture loop, variant-aware graph keys (_make_graph_key, _resolve_lora_variant), replay routing
lora_moe_runners.py: Skip LoRA kernels during nolora graph capture via get_capture_lora_variant()

Bench results (Qwen3-30B-A3B, tp=4, bs=512 decode)

Comparison against main (single CUDA graph) running the same bench harness with lora and nolora requests:

	Main LoRA	Main NoLoRA	Branch NoLoRA (dual graph)
decode bs=512 tput	17,549 tok/s	31,445 tok/s	40,740 tok/s
single tpot	6.7ms	6.0ms	4.3ms
single latency	2.04s	1.99s	1.41s
prefill 4K ttft	0.63s	0.61s	0.58s

Main's nolora uses the same single graph as lora (LoRA kernels recorded but adapter_enabled=0 so they early-exit). Branch nolora uses a separate graph with no LoRA kernels recorded at all.

Decode throughput: 40,740 vs 31,445 = 1.3x over main nolora, 2.3x over main lora.

Correctness validation

MMLU accuracy: 82.1% (nolora) vs 81.9% (lora) — within noise
Prefill logprobs: mean_diff=0.18, max_diff=2.06
Decode logprobs (CUDA graphed): mean_diff=0.65, max_diff=2.14
Mixed batch: 8 concurrent lora + 8 nolora requests produce correct distinct results

gemini-code-assist · 2026-04-14T13:45:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1099f03814

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T13:55:21Z

    if get_is_capture_mode():
-        # During CUDA graph capture, always enter the LoRA path so that
-        # the LoRA kernels are recorded in the graph.  adapter_enabled is
-        # all-zeros during capture, so the Triton kernel early-exits per
-        # program (zero overhead).  During replay the tensor is updated
-        # in-place with the real adapter mask before graph.replay().
-        has_active_lora = True
+        from sglang.srt.model_executor.cuda_graph_runner import get_capture_lora_variant
+        # Record LoRA kernels for lora graph; skip for nolora graph.
+        has_active_lora = get_capture_lora_variant() != "nolora"


Bypass LoRA alignment when capturing the nolora variant

The new nolora capture guard only short-circuits _add_lora_*_delta, but the LoRA hook setup still runs beforehand and computes routing alignment (build_lora_hooks() -> _compute_lora_alignment()), which launches LoRA-specific kernels during graph capture/replay. In adapter-free batches this means the nolora graph still contains LoRA routing work, so the intended speedup is only partial. Consider gating hook construction (or alignment computation) on the capture variant, not just the delta injection functions.

Useful? React with 👍 / 👎.

sshleifer · 2026-04-14T17:42:36Z

/tag-and-rerun-ci

When enable_lora=True and record_nolora_graph is set, capture each batch size twice: once with LoRA hooks and once without. This avoids performance penalties from LoRA hooks on non-LoRA requests. Also extracts _default_make_graph_key as a module-level function so CudaGraphRunner.capture() works when called cross-class from EAGLEDraftCudaGraphRunner (which doesn't inherit from CudaGraphRunner).

yushengsu-thu · 2026-04-18T23:48:35Z

/tag-and-rerun-ci

yushengsu-thu · 2026-04-20T00:36:33Z

/tag-and-rerun-ci

yushengsu-thu · 2026-04-20T01:05:58Z

/tag-and-rerun-ci

yushengsu-thu · 2026-04-20T01:12:20Z

/tag-and-rerun-ci

yushengsu-thu · 2026-04-20T01:13:47Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T05:57:47Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T06:08:11Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T06:31:09Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T08:11:37Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T17:59:48Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T20:27:37Z

/rerun-failed-ci

yushengsu-thu · 2026-04-20T22:45:46Z

/rerun-failed-ci

yushengsu-thu · 2026-04-21T06:23:02Z

/rerun-failed-ci

yushengsu-thu · 2026-04-21T08:34:23Z

/rerun-failed-ci

yushengsu-thu · 2026-04-21T22:42:34Z

/rerun-failed-ci

yushengsu-thu · 2026-04-22T05:53:52Z

/rerun-failed-ci

yushengsu-thu · 2026-04-22T07:28:52Z

/rerun-failed-ci

sshleifer requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, lifuhuang, merrymercy and yushengsu-thu as code owners April 14, 2026 13:45

github-actions Bot added the lora label Apr 14, 2026

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

sshleifer force-pushed the sam/dual-moe-cuda-graphs branch from 1099f03 to ff32965 Compare April 14, 2026 17:33

sshleifer force-pushed the sam/dual-moe-cuda-graphs branch from ff32965 to e9f319e Compare April 14, 2026 17:59

yushengsu-thu self-assigned this Apr 14, 2026

sshleifer force-pushed the sam/dual-moe-cuda-graphs branch from e9f319e to f91d9d6 Compare April 14, 2026 20:43

yushengsu-thu added 2 commits April 18, 2026 13:17

Merge branch 'main' into sam/dual-moe-cuda-graphs

9461748

fix pre-commit

1a7bf9b

github-actions Bot added the run-ci label Apr 18, 2026

Merge branch 'main' into sam/dual-moe-cuda-graphs

20cd385

yushengsu-thu added the high priority label Apr 20, 2026

chore: trigger CI

f3dfb29

yushengsu-thu enabled auto-merge (squash) April 20, 2026 06:31

Merge branch 'main' into sam/dual-moe-cuda-graphs

943c7b9

yushengsu-thu mentioned this pull request Apr 20, 2026

Support moe lora for gpt-oss radixark/miles#798

Merged

Merge branch 'main' into sam/dual-moe-cuda-graphs

8c22baf

yushengsu-thu approved these changes Apr 22, 2026

View reviewed changes

yushengsu-thu added high priority and removed high priority labels Apr 22, 2026

yushengsu-thu approved these changes Apr 22, 2026

View reviewed changes

Fridge003 disabled auto-merge April 22, 2026 21:10

Fridge003 merged commit b9e33d6 into sgl-project:main Apr 22, 2026
480 of 575 checks passed

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

Dual MoE CUDA graph capture for lora/nolora batches (sgl-project#22809)

68e622a

sshleifer mentioned this pull request May 3, 2026

Deprecate record_nolora_graph dual MoE CUDA graph capture #24314

Open

2 tasks

Conversation

sshleifer commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bench results (Qwen3-30B-A3B, tp=4, bs=512 decode)

Correctness validation

Uh oh!

gemini-code-assist Bot commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

sshleifer commented Apr 14, 2026

Uh oh!

yushengsu-thu commented Apr 18, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 20, 2026

Uh oh!

yushengsu-thu commented Apr 21, 2026

Uh oh!

yushengsu-thu commented Apr 21, 2026

Uh oh!

yushengsu-thu commented Apr 21, 2026

Uh oh!

yushengsu-thu commented Apr 22, 2026

Uh oh!

yushengsu-thu commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sshleifer commented Apr 14, 2026 •

edited

Loading