Dual MoE CUDA graph capture for lora/nolora batches#22809
Dual MoE CUDA graph capture for lora/nolora batches#22809Fridge003 merged 7 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1099f03814
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if get_is_capture_mode(): | ||
| # During CUDA graph capture, always enter the LoRA path so that | ||
| # the LoRA kernels are recorded in the graph. adapter_enabled is | ||
| # all-zeros during capture, so the Triton kernel early-exits per | ||
| # program (zero overhead). During replay the tensor is updated | ||
| # in-place with the real adapter mask before graph.replay(). | ||
| has_active_lora = True | ||
| from sglang.srt.model_executor.cuda_graph_runner import get_capture_lora_variant | ||
| # Record LoRA kernels for lora graph; skip for nolora graph. | ||
| has_active_lora = get_capture_lora_variant() != "nolora" |
There was a problem hiding this comment.
Bypass LoRA alignment when capturing the nolora variant
The new nolora capture guard only short-circuits _add_lora_*_delta, but the LoRA hook setup still runs beforehand and computes routing alignment (build_lora_hooks() -> _compute_lora_alignment()), which launches LoRA-specific kernels during graph capture/replay. In adapter-free batches this means the nolora graph still contains LoRA routing work, so the intended speedup is only partial. Consider gating hook construction (or alignment computation) on the capture variant, not just the delta injection functions.
Useful? React with 👍 / 👎.
1099f03 to
ff32965
Compare
|
/tag-and-rerun-ci |
ff32965 to
e9f319e
Compare
When enable_lora=True and record_nolora_graph is set, capture each batch size twice: once with LoRA hooks and once without. This avoids performance penalties from LoRA hooks on non-LoRA requests. Also extracts _default_make_graph_key as a module-level function so CudaGraphRunner.capture() works when called cross-class from EAGLEDraftCudaGraphRunner (which doesn't inherit from CudaGraphRunner).
e9f319e to
f91d9d6
Compare
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
1 similar comment
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Summary
When LoRA is enabled with a triton MoE backend, capture two sets of CUDA graphs per batch size: one with LoRA kernels recorded and one without. At replay time, batches without active adapters use the faster nolora graph, avoiding LoRA kernel overhead entirely. Controlled by
--record-nolora-graph(default True), auto-disabled for non-triton MoE backends.--record-nolora-graph/--no-record-nolora-graphflagRECORD_NOLORA_GRAPHconfig with triton backend validation andshould_record_nolora_graph()accessor_make_graph_key,_resolve_lora_variant), replay routingget_capture_lora_variant()Bench results (Qwen3-30B-A3B, tp=4, bs=512 decode)
Comparison against main (single CUDA graph) running the same bench harness with lora and nolora requests:
Main's nolora uses the same single graph as lora (LoRA kernels recorded but adapter_enabled=0 so they early-exit). Branch nolora uses a separate graph with no LoRA kernels recorded at all.
Decode throughput: 40,740 vs 31,445 = 1.3x over main nolora, 2.3x over main lora.
Correctness validation