[Lora] Lora quat info re-factor and support deepseekv3 mla lora by yushengsu-thu · Pull Request #22323 · sgl-project/sglang

yushengsu-thu · 2026-04-08T04:57:24Z

Motivation

Modifications

MoE quant-info refactor
Extracts get_triton_quant_info() into each quantization method (FP8, INT8, WNA16, etc.) so FusedMoEWithLoRA correctly receives quantization scales/flags, enabling LoRA on quantized MoE models.
DeepSeek-V3 MLA LoRA support
Adds ReplicatedLinearWithLoRA to handle the fused q_a_proj + kv_a_proj_with_mqa projection unique to MLA. Since the two sub-projections have unequal output dimensions, LoRA B is split and applied via two separate sgemm calls. Includes weight fusion, hidden-dim fixes (v_head_dim for o_proj, shared-expert sizes for MoE layers), and TP exclusion for replicated layers.
CI: Adds a 5-GPU GB200 logprob accuracy test against DeepSeek-V3.1-Base.

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

When adapter_config.json uses PEFT shorthands like "all-linear" or "all", SGLang previously required users to explicitly specify --lora-target-modules on the CLI. This change adds a model-scanning approach that inspects the loaded base model to discover all LoRA-compatible linear modules automatically. Changes: - utils.py: add auto_detect_lora_target_modules() that walks the model graph, collects LinearBase/FusedMoE/ParallelLMHead module suffixes, normalizes them, and filters to the set supported by get_hidden_dim and init_buffers. - lora_manager.py: in init_lora_shapes(), resolve "all-linear"/"all" via model scanning instead of raising ValueError when CLI target modules are not provided. In init_lora_modules(), guard against modules outside decoder layers (layer_id is None) to prevent TypeError on non-layer modules. Made-with: Cursor

…fallbacks 1. layers.py: fix RowParallelLinearWithLoRA bias handling to pass bias into quant_method.apply(), matching base RowParallelLinear behavior; add interleaved gate/up layout support in FusedMoEWithLoRA for models using gemm1_alpha (e.g. gpt-oss-20b) 2. mem_pool.py: zero-initialize all LoRA buffers (torch.empty -> torch.zeros) to prevent garbage values in unused slots 3. utils.py: fall back to config.intermediate_size when moe_intermediate_size is not available in get_hidden_dim (supports GptOss, Mixtral, OLMoE, PhiMoE, GraniteMoE, Grok, etc.); accept PEFT shorthand "all-linear" in get_normalized_target_modules; fix isinstance order in auto_detect_lora_target_modules so ParallelLMHead is checked before VocabParallelEmbedding 4. gpt_oss.py: add should_apply_lora() to GptOssForCausalLM for explicit LoRA module filtering, consistent with Qwen3VLMoe Made-with: Cursor

Regression test comparing SGLang LoRA logprobs against reference training logprobs (KL threshold 1e-2). Uses 8-GPU H200 suite with triton MoE runner and shared outer LoRA mode. Adapter checkpoint: yushengsu/lora-diff-gpt-oss-20b Made-with: Cursor

Pre-allocate MoE intermediate buffers before memory profiling so KV cache sizing accounts for them. Reuse fixed buffers during capture/replay instead of dynamic torch.empty() allocations.

…raph

Extract get_triton_quant_info() into FusedMoEMethodBase and each quant method (Fp8, W8A8Fp8, W8A8Int8, BlockInt8, MoeWNA16, Unquantized) so FusedMoEWithLoRA uses the polymorphic method instead of hardcoding TritonMoeQuantInfo. Enables LoRA on quantized MoE models. Made-with: Cursor

- Add ReplicatedLinearWithLoRA for fused_qkv_a_proj_with_mqa, applying LoRA B via two separate sgemm calls for unequal output partitions (q_a_proj=1536 vs kv_a_proj_with_mqa=576). B slices are precomputed in set_lora_info to avoid per-forward allocation. - Add normalize_fused_qkv_a_proj to fuse q_a_proj + kv_a_proj_with_mqa adapter weights into a single stacked entry. - Add stack_num parameter to run_lora_a_sgemm across all 3 backends. - Fix o_proj hidden dim to use v_head_dim for MLA models. - Fix gate_up_proj/down_proj hidden dim to use per-layer shared expert intermediate size on MoE layers. - Exclude ReplicatedLinear from TP sharding in memory pool allocation. Made-with: Cursor

gemini-code-assist · 2026-04-08T04:57:28Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Made-with: Cursor

Copilot

Pull request overview

This PR refactors LoRA quant-info handling and adds DeepSeek V3 MLA LoRA support, while also improving LoRA behavior under CUDA Graph by introducing early MoE buffer preallocation and capture-mode execution changes.

Changes:

Add DeepSeek MLA fused projection LoRA support (new normalized target module, hidden-dim logic, and adapter weight normalization).
Rework LoRA + CUDA Graph integration (two-phase init, capture-mode gating, and shared MoE buffer preallocation).
Improve CLI/test defaults around LoRA and CUDA Graph (e.g., --[no-]experts-shared-outer-loras, tests no longer forcing disable_cuda_graph=True).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`python/sglang/srt/lora/lora.py`	Adds fused-weight normalization for DeepSeek MLA-style adapters.
`python/sglang/srt/lora/utils.py`	Extends target-module normalization and dimension inference for new fused modules and embeddings.
`python/sglang/srt/lora/mem_pool.py`	Adjusts TP sharding rules for replicated LoRA modules and tightens validation for shared-expert outer-LoRA loads.
`python/sglang/srt/lora/layers.py`	Integrates MoE LoRA runner changes, CUDA-graph adapter masks/buffers, and caches Triton quant-info for MoE LoRA.
`python/sglang/srt/lora/lora_moe_runners.py`	Updates MoE runner control flow for capture-mode and optional CUDA-graph buffer reuse.
`python/sglang/srt/lora/backend/base_backend.py`	Introduces backend-agnostic MoE CUDA-graph buffer preallocation used by the Triton LoRA MoE path.
`python/sglang/srt/model_executor/model_runner.py`	Adds “Phase 1” LoRA CUDA-graph init to pre-allocate MoE buffers before memory profiling/pool init.
`python/sglang/srt/model_executor/cuda_graph_runner.py`	Documents “Phase 2” LoRA CUDA-graph init for dense batch metadata.
`python/sglang/srt/server_args.py`	Switches `--experts-shared-outer-loras` to a boolean optional flag with `--no-...` support.
`python/sglang/srt/lora/triton_ops/fused_moe_lora_kernel.py`	Tweaks kernel math to align operand dtypes for `tl.dot`.
`test/registered/lora/test_lora_qwen3_vl_30b_a3b_instruct_logprob_diff.py`	Stops forcing `disable_cuda_graph=True` in LoRA logprob regression.
`test/registered/lora/test_lora_qwen3_8b_logprob_diff.py`	Same as above.
`test/registered/lora/test_lora_qwen3_30b_a3b_instruct_2507_logprob_diff.py`	Same as above.
`test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py`	Same as above.

Comments suppressed due to low confidence (1)

python/sglang/srt/lora/layers.py:855

In CUDA-graph mode, adapter_enabled is always populated via index_fill_ from batch_info.weight_indices, even when batch_info.has_active_lora is False (base-only). But TritonRunnerCoreWithLoRA forces the LoRA path during capture, so this ends up enabling at least the base slot and prevents the intended “all-zeros adapter_enabled” early-exit behavior during capture. Consider skipping index_fill_ (leave adapter_enabled all zeros) when has_active_lora is False (and/or in capture mode) so capture records the kernels without doing full work for the base adapter.

            adapter_enabled = cg_buffers["adapter_enabled"]
            adapter_enabled.zero_()
            idx_buf = cg_buffers["weight_indices_long"]
            idx_buf[: batch_info.bs] = batch_info.weight_indices[: batch_info.bs]
            adapter_enabled.index_fill_(0, idx_buf[: batch_info.bs], 1)
        else:
            adapter_enabled = torch.zeros(
                len(lora_ranks), dtype=torch.int32, device=lora_ranks.device

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T05:15:44Z

+            kv_a_weight = (
+                weights[kv_a_name]
+                if kv_a_name in weights
+                else torch.zeros_like(weights[q_a_name])
+            )
+
+            weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0)
+            weights.pop(q_a_name)
+            if kv_a_name in weights:
+                weights.pop(kv_a_name)


normalize_fused_qkv_a_proj falls back to torch.zeros_like(q_a) when kv_a_proj_with_mqa weights are missing. For LoRA B this is very likely the wrong shape because q_a_proj and kv_a_proj_with_mqa have different output dims (q_lora_rank vs kv_lora_rank+qk_rope_head_dim), which can lead to silently incorrect fused weight shapes or downstream shape/assert failures. Consider requiring both weights to be present (raise a clear error) or constructing a correctly-shaped zero tensor for the missing side based on base_hf_config dims and the existing weight’s rank/input dims.

Suggested change

kv_a_weight = (

weights[kv_a_name]

if kv_a_name in weights

else torch.zeros_like(weights[q_a_name])

)

weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0)

weights.pop(q_a_name)

if kv_a_name in weights:

weights.pop(kv_a_name)

if q_a_name not in weights or kv_a_name not in weights:

missing_weights = []

if q_a_name not in weights:

missing_weights.append(q_a_name)

if kv_a_name not in weights:

missing_weights.append(kv_a_name)

raise ValueError(

"Cannot fuse LoRA qkv_a_proj weights: expected both "

f"'{q_a_name}' and '{kv_a_name}' to be present when "

f"building '{fused_name}', but missing {missing_weights}."

)

weights[fused_name] = torch.cat((weights[q_a_name], weights[kv_a_name]), dim=0)

weights.pop(q_a_name)

weights.pop(kv_a_name)

Copilot · 2026-04-08T05:15:44Z

        # Pre-compute quant info for efficiency (weights don't change during inference)
-        self._quant_info = TritonMoeQuantInfo(
-            w13_weight=base_layer.w13_weight,
-            w2_weight=base_layer.w2_weight,
-            b13=getattr(base_layer, "w13_weight_bias", None),
-            b2=getattr(base_layer, "w2_weight_bias", None),
-        )
+        self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer)

    def set_lora_info(


FusedMoEWithLoRA caches self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer), but some MoE quant methods (e.g. ModelOptFp8MoEMethod in layers/quantization/modelopt_quant.py) construct a non-default TritonMoeQuantInfo inside apply() and do not override get_triton_quant_info(). In that case LoRA MoE will run Triton kernels with the default unquantized descriptor, which can produce incorrect outputs or crashes when weights/scales are actually FP8/NVFP4/etc. Please ensure every MoE quant method that can be used with the Triton MoE runner overrides get_triton_quant_info() to return the same descriptor it uses in apply(), or add a guard here to fail fast if get_triton_quant_info() is not implemented for the active quant method.

…eek-V3-MLA

Fridge003 · 2026-04-08T21:56:26Z

/tag-and-rerun-ci

yushengsu-thu · 2026-04-09T00:42:38Z

/rerun-failed-ci

yushengsu-thu · 2026-04-09T18:36:42Z

stage-b-test-4-gpu-b200

…project#22323)

yushengsu-thu and others added 25 commits March 26, 2026 00:19

add ci

4f30d8c

pre-commit

add28dc

support shared lora foramte and qwen3_30b_a3b_instruct_2507

767655c

pre-commit

3ea7296

update

4fd3f01

sgl-project#21439 fixed issue

39c6445

pre-commit

a7f6a64

Support CUDA graph capture/replay for MoE LoRA

581aeb4

Pre-allocate MoE intermediate buffers before memory profiling so KV cache sizing accounts for them. Reuse fixed buffers during capture/replay instead of dynamic torch.empty() allocations.

enable cuda graph test

6bba4d7

run pre-sommit

3d1b576

merge

031d3dd

enable gc in test

8d09a3e

update

b1c263c

Merge remote-tracking branch 'upstream/main' into lora-support-cuda-g…

5023157

…raph

nit fix

38e9283

nit fix

36680f9

Merge branch 'main' into lora-support-cuda-graph

88259b5

Merge branch 'main' into lora-support-cuda-graph

5727aad

fix lnit

1bd81e4

Merge branch 'main' into lora-support-cuda-graph

788d40c

Copilot AI review requested due to automatic review settings April 8, 2026 04:57

yushengsu-thu requested review from Fridge003, Ying1123, hnyls2002 and merrymercy as code owners April 8, 2026 04:57

github-actions Bot added quant LLM Quantization lora jit-kernel labels Apr 8, 2026

Copilot started reviewing on behalf of yushengsu-thu April 8, 2026 04:58 View session

Add CI test for DeepSeek-V3.1-Base MLA LoRA logprob accuracy

17d373c

Made-with: Cursor

github-actions Bot added the deepseek label Apr 8, 2026

yushengsu-thu added 2 commits April 8, 2026 05:05

merge

f363e27

pre-commit

13cd5af

Copilot AI reviewed Apr 8, 2026

View reviewed changes

fix

020959a

yushengsu-thu changed the title ~~[Lora] Lora quat info re-factor and support deep seek v3 mla lora~~ [Lora] Lora quat info re-factor and support deepseekv3 mla lora Apr 8, 2026

Fridge003 reviewed Apr 8, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/quantization/base_config.py Outdated

Comment thread test/registered/lora/test_lora_deepseek_v3_base_logprob_diff.py Outdated

Merge remote-tracking branch 'upstream/main' into lora-quat-and-DeepS…

ef55b2a

…eek-V3-MLA

yushengsu-thu added run-ci high priority labels Apr 8, 2026

fix comments part

8f92663

yushengsu-thu added 2 commits April 8, 2026 22:02

update

7c447d9

fix

af076fd

fix

a7e27d6

yushengsu-thu added high priority and removed high priority labels Apr 9, 2026

update

929d122

Fridge003 approved these changes Apr 9, 2026

View reviewed changes

Fridge003 merged commit 28ef6de into sgl-project:main Apr 9, 2026
27 of 44 checks passed

yushengsu-thu added a commit that referenced this pull request Apr 17, 2026

[Lora] Lora quat info re-factor and support deepseekv3 mla lora (#22323)

6db2595

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Lora] Lora quat info re-factor and support deepseekv3 mla lora (sgl-…

b25b555

…project#22323)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lora] Lora quat info re-factor and support deepseekv3 mla lora#22323

[Lora] Lora quat info re-factor and support deepseekv3 mla lora#22323
Fridge003 merged 35 commits intosgl-project:mainfrom
yushengsu-thu:lora-quat-and-DeepSeek-V3-MLA

yushengsu-thu commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Apr 8, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            kv_a_weight = (
-                weights[kv_a_name]
-                if kv_a_name in weights
-                else torch.zeros_like(weights[q_a_name])
-            )
-            weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0)
-            weights.pop(q_a_name)
-            if kv_a_name in weights:
-                weights.pop(kv_a_name)
+            if q_a_name not in weights or kv_a_name not in weights:
+                missing_weights = []
+                if q_a_name not in weights:
+                    missing_weights.append(q_a_name)
+                if kv_a_name not in weights:
+                    missing_weights.append(kv_a_name)
+                raise ValueError(
+                    "Cannot fuse LoRA qkv_a_proj weights: expected both "
+                    f"'{q_a_name}' and '{kv_a_name}' to be present when "
+                    f"building '{fused_name}', but missing {missing_weights}."
+                )
+            weights[fused_name] = torch.cat((weights[q_a_name], weights[kv_a_name]), dim=0)
+            weights.pop(q_a_name)
+            weights.pop(kv_a_name)

Conversation

yushengsu-thu commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Apr 8, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yushengsu-thu commented Apr 8, 2026 •

edited

Loading