Skip to content

Dual MoE CUDA graph capture for lora/nolora batches#22809

Merged
Fridge003 merged 7 commits intosgl-project:mainfrom
sshleifer:sam/dual-moe-cuda-graphs
Apr 22, 2026
Merged

Dual MoE CUDA graph capture for lora/nolora batches#22809
Fridge003 merged 7 commits intosgl-project:mainfrom
sshleifer:sam/dual-moe-cuda-graphs

Conversation

@sshleifer
Copy link
Copy Markdown
Contributor

@sshleifer sshleifer commented Apr 14, 2026

Summary

When LoRA is enabled with a triton MoE backend, capture two sets of CUDA graphs per batch size: one with LoRA kernels recorded and one without. At replay time, batches without active adapters use the faster nolora graph, avoiding LoRA kernel overhead entirely. Controlled by --record-nolora-graph (default True), auto-disabled for non-triton MoE backends.

  • server_args.py: Add --record-nolora-graph / --no-record-nolora-graph flag
  • moe/utils.py: Add RECORD_NOLORA_GRAPH config with triton backend validation and should_record_nolora_graph() accessor
  • cuda_graph_runner.py: Dual capture loop, variant-aware graph keys (_make_graph_key, _resolve_lora_variant), replay routing
  • lora_moe_runners.py: Skip LoRA kernels during nolora graph capture via get_capture_lora_variant()

Bench results (Qwen3-30B-A3B, tp=4, bs=512 decode)

Comparison against main (single CUDA graph) running the same bench harness with lora and nolora requests:

Main LoRA Main NoLoRA Branch NoLoRA (dual graph)
decode bs=512 tput 17,549 tok/s 31,445 tok/s 40,740 tok/s
single tpot 6.7ms 6.0ms 4.3ms
single latency 2.04s 1.99s 1.41s
prefill 4K ttft 0.63s 0.61s 0.58s

Main's nolora uses the same single graph as lora (LoRA kernels recorded but adapter_enabled=0 so they early-exit). Branch nolora uses a separate graph with no LoRA kernels recorded at all.

Decode throughput: 40,740 vs 31,445 = 1.3x over main nolora, 2.3x over main lora.

Correctness validation

  • MMLU accuracy: 82.1% (nolora) vs 81.9% (lora) — within noise
  • Prefill logprobs: mean_diff=0.18, max_diff=2.06
  • Decode logprobs (CUDA graphed): mean_diff=0.65, max_diff=2.14
  • Mixed batch: 8 concurrent lora + 8 nolora requests produce correct distinct results

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the lora label Apr 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1099f03814

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 446 to +449
if get_is_capture_mode():
# During CUDA graph capture, always enter the LoRA path so that
# the LoRA kernels are recorded in the graph. adapter_enabled is
# all-zeros during capture, so the Triton kernel early-exits per
# program (zero overhead). During replay the tensor is updated
# in-place with the real adapter mask before graph.replay().
has_active_lora = True
from sglang.srt.model_executor.cuda_graph_runner import get_capture_lora_variant
# Record LoRA kernels for lora graph; skip for nolora graph.
has_active_lora = get_capture_lora_variant() != "nolora"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bypass LoRA alignment when capturing the nolora variant

The new nolora capture guard only short-circuits _add_lora_*_delta, but the LoRA hook setup still runs beforehand and computes routing alignment (build_lora_hooks() -> _compute_lora_alignment()), which launches LoRA-specific kernels during graph capture/replay. In adapter-free batches this means the nolora graph still contains LoRA routing work, so the intended speedup is only partial. Consider gating hook construction (or alignment computation) on the capture variant, not just the delta injection functions.

Useful? React with 👍 / 👎.

@sshleifer sshleifer force-pushed the sam/dual-moe-cuda-graphs branch from 1099f03 to ff32965 Compare April 14, 2026 17:33
@sshleifer
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@sshleifer sshleifer force-pushed the sam/dual-moe-cuda-graphs branch from ff32965 to e9f319e Compare April 14, 2026 17:59
@yushengsu-thu yushengsu-thu self-assigned this Apr 14, 2026
When enable_lora=True and record_nolora_graph is set, capture each batch
size twice: once with LoRA hooks and once without. This avoids performance
penalties from LoRA hooks on non-LoRA requests.

Also extracts _default_make_graph_key as a module-level function so
CudaGraphRunner.capture() works when called cross-class from
EAGLEDraftCudaGraphRunner (which doesn't inherit from CudaGraphRunner).
@sshleifer sshleifer force-pushed the sam/dual-moe-cuda-graphs branch from e9f319e to f91d9d6 Compare April 14, 2026 20:43
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

1 similar comment
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu yushengsu-thu enabled auto-merge (squash) April 20, 2026 06:31
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@yushengsu-thu
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@Fridge003 Fridge003 disabled auto-merge April 22, 2026 21:10
@Fridge003 Fridge003 merged commit b9e33d6 into sgl-project:main Apr 22, 2026
480 of 575 checks passed
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants