Skip to content

[Lora] Lora quat info re-factor and support deepseekv3 mla lora#22323

Merged
Fridge003 merged 35 commits intosgl-project:mainfrom
yushengsu-thu:lora-quat-and-DeepSeek-V3-MLA
Apr 9, 2026
Merged

[Lora] Lora quat info re-factor and support deepseekv3 mla lora#22323
Fridge003 merged 35 commits intosgl-project:mainfrom
yushengsu-thu:lora-quat-and-DeepSeek-V3-MLA

Conversation

@yushengsu-thu
Copy link
Copy Markdown
Collaborator

@yushengsu-thu yushengsu-thu commented Apr 8, 2026

Motivation

Modifications

  • MoE quant-info refactor
    Extracts get_triton_quant_info() into each quantization method (FP8, INT8, WNA16, etc.) so FusedMoEWithLoRA correctly receives quantization scales/flags, enabling LoRA on quantized MoE models.

  • DeepSeek-V3 MLA LoRA support
    Adds ReplicatedLinearWithLoRA to handle the fused q_a_proj + kv_a_proj_with_mqa projection unique to MLA. Since the two sub-projections have unequal output dimensions, LoRA B is split and applied via two separate sgemm calls. Includes weight fusion, hidden-dim fixes (v_head_dim for o_proj, shared-expert sizes for MoE layers), and TP exclusion for replicated layers.

  • CI: Adds a 5-GPU GB200 logprob accuracy test against DeepSeek-V3.1-Base.

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

yushengsu-thu and others added 25 commits March 26, 2026 00:19
When adapter_config.json uses PEFT shorthands like "all-linear" or "all",
SGLang previously required users to explicitly specify --lora-target-modules
on the CLI. This change adds a model-scanning approach that inspects the
loaded base model to discover all LoRA-compatible linear modules automatically.

Changes:
- utils.py: add auto_detect_lora_target_modules() that walks the model graph,
  collects LinearBase/FusedMoE/ParallelLMHead module suffixes, normalizes
  them, and filters to the set supported by get_hidden_dim and init_buffers.
- lora_manager.py: in init_lora_shapes(), resolve "all-linear"/"all" via
  model scanning instead of raising ValueError when CLI target modules are
  not provided. In init_lora_modules(), guard against modules outside decoder
  layers (layer_id is None) to prevent TypeError on non-layer modules.

Made-with: Cursor
…fallbacks

1. layers.py: fix RowParallelLinearWithLoRA bias handling to pass bias
   into quant_method.apply(), matching base RowParallelLinear behavior;
   add interleaved gate/up layout support in FusedMoEWithLoRA for models
   using gemm1_alpha (e.g. gpt-oss-20b)

2. mem_pool.py: zero-initialize all LoRA buffers (torch.empty ->
   torch.zeros) to prevent garbage values in unused slots

3. utils.py: fall back to config.intermediate_size when
   moe_intermediate_size is not available in get_hidden_dim (supports
   GptOss, Mixtral, OLMoE, PhiMoE, GraniteMoE, Grok, etc.); accept
   PEFT shorthand "all-linear" in get_normalized_target_modules; fix
   isinstance order in auto_detect_lora_target_modules so ParallelLMHead
   is checked before VocabParallelEmbedding

4. gpt_oss.py: add should_apply_lora() to GptOssForCausalLM for
   explicit LoRA module filtering, consistent with Qwen3VLMoe

Made-with: Cursor
Regression test comparing SGLang LoRA logprobs against reference
training logprobs (KL threshold 1e-2). Uses 8-GPU H200 suite with
triton MoE runner and shared outer LoRA mode.

Adapter checkpoint: yushengsu/lora-diff-gpt-oss-20b

Made-with: Cursor
Pre-allocate MoE intermediate buffers before memory profiling so
KV cache sizing accounts for them. Reuse fixed buffers during
capture/replay instead of dynamic torch.empty() allocations.
Extract get_triton_quant_info() into FusedMoEMethodBase and each quant
method (Fp8, W8A8Fp8, W8A8Int8, BlockInt8, MoeWNA16, Unquantized) so
FusedMoEWithLoRA uses the polymorphic method instead of hardcoding
TritonMoeQuantInfo. Enables LoRA on quantized MoE models.

Made-with: Cursor
- Add ReplicatedLinearWithLoRA for fused_qkv_a_proj_with_mqa, applying
  LoRA B via two separate sgemm calls for unequal output partitions
  (q_a_proj=1536 vs kv_a_proj_with_mqa=576). B slices are precomputed
  in set_lora_info to avoid per-forward allocation.
- Add normalize_fused_qkv_a_proj to fuse q_a_proj + kv_a_proj_with_mqa
  adapter weights into a single stacked entry.
- Add stack_num parameter to run_lora_a_sgemm across all 3 backends.
- Fix o_proj hidden dim to use v_head_dim for MLA models.
- Fix gate_up_proj/down_proj hidden dim to use per-layer shared expert
  intermediate size on MoE layers.
- Exclude ReplicatedLinear from TP sharding in memory pool allocation.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 8, 2026 04:57
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors LoRA quant-info handling and adds DeepSeek V3 MLA LoRA support, while also improving LoRA behavior under CUDA Graph by introducing early MoE buffer preallocation and capture-mode execution changes.

Changes:

  • Add DeepSeek MLA fused projection LoRA support (new normalized target module, hidden-dim logic, and adapter weight normalization).
  • Rework LoRA + CUDA Graph integration (two-phase init, capture-mode gating, and shared MoE buffer preallocation).
  • Improve CLI/test defaults around LoRA and CUDA Graph (e.g., --[no-]experts-shared-outer-loras, tests no longer forcing disable_cuda_graph=True).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/sglang/srt/lora/lora.py Adds fused-weight normalization for DeepSeek MLA-style adapters.
python/sglang/srt/lora/utils.py Extends target-module normalization and dimension inference for new fused modules and embeddings.
python/sglang/srt/lora/mem_pool.py Adjusts TP sharding rules for replicated LoRA modules and tightens validation for shared-expert outer-LoRA loads.
python/sglang/srt/lora/layers.py Integrates MoE LoRA runner changes, CUDA-graph adapter masks/buffers, and caches Triton quant-info for MoE LoRA.
python/sglang/srt/lora/lora_moe_runners.py Updates MoE runner control flow for capture-mode and optional CUDA-graph buffer reuse.
python/sglang/srt/lora/backend/base_backend.py Introduces backend-agnostic MoE CUDA-graph buffer preallocation used by the Triton LoRA MoE path.
python/sglang/srt/model_executor/model_runner.py Adds “Phase 1” LoRA CUDA-graph init to pre-allocate MoE buffers before memory profiling/pool init.
python/sglang/srt/model_executor/cuda_graph_runner.py Documents “Phase 2” LoRA CUDA-graph init for dense batch metadata.
python/sglang/srt/server_args.py Switches --experts-shared-outer-loras to a boolean optional flag with --no-... support.
python/sglang/srt/lora/triton_ops/fused_moe_lora_kernel.py Tweaks kernel math to align operand dtypes for tl.dot.
test/registered/lora/test_lora_qwen3_vl_30b_a3b_instruct_logprob_diff.py Stops forcing disable_cuda_graph=True in LoRA logprob regression.
test/registered/lora/test_lora_qwen3_8b_logprob_diff.py Same as above.
test/registered/lora/test_lora_qwen3_30b_a3b_instruct_2507_logprob_diff.py Same as above.
test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py Same as above.
Comments suppressed due to low confidence (1)

python/sglang/srt/lora/layers.py:855

  • In CUDA-graph mode, adapter_enabled is always populated via index_fill_ from batch_info.weight_indices, even when batch_info.has_active_lora is False (base-only). But TritonRunnerCoreWithLoRA forces the LoRA path during capture, so this ends up enabling at least the base slot and prevents the intended “all-zeros adapter_enabled” early-exit behavior during capture. Consider skipping index_fill_ (leave adapter_enabled all zeros) when has_active_lora is False (and/or in capture mode) so capture records the kernels without doing full work for the base adapter.
            adapter_enabled = cg_buffers["adapter_enabled"]
            adapter_enabled.zero_()
            idx_buf = cg_buffers["weight_indices_long"]
            idx_buf[: batch_info.bs] = batch_info.weight_indices[: batch_info.bs]
            adapter_enabled.index_fill_(0, idx_buf[: batch_info.bs], 1)
        else:
            adapter_enabled = torch.zeros(
                len(lora_ranks), dtype=torch.int32, device=lora_ranks.device

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +262 to +271
kv_a_weight = (
weights[kv_a_name]
if kv_a_name in weights
else torch.zeros_like(weights[q_a_name])
)

weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0)
weights.pop(q_a_name)
if kv_a_name in weights:
weights.pop(kv_a_name)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_fused_qkv_a_proj falls back to torch.zeros_like(q_a) when kv_a_proj_with_mqa weights are missing. For LoRA B this is very likely the wrong shape because q_a_proj and kv_a_proj_with_mqa have different output dims (q_lora_rank vs kv_lora_rank+qk_rope_head_dim), which can lead to silently incorrect fused weight shapes or downstream shape/assert failures. Consider requiring both weights to be present (raise a clear error) or constructing a correctly-shaped zero tensor for the missing side based on base_hf_config dims and the existing weight’s rank/input dims.

Suggested change
kv_a_weight = (
weights[kv_a_name]
if kv_a_name in weights
else torch.zeros_like(weights[q_a_name])
)
weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0)
weights.pop(q_a_name)
if kv_a_name in weights:
weights.pop(kv_a_name)
if q_a_name not in weights or kv_a_name not in weights:
missing_weights = []
if q_a_name not in weights:
missing_weights.append(q_a_name)
if kv_a_name not in weights:
missing_weights.append(kv_a_name)
raise ValueError(
"Cannot fuse LoRA qkv_a_proj weights: expected both "
f"'{q_a_name}' and '{kv_a_name}' to be present when "
f"building '{fused_name}', but missing {missing_weights}."
)
weights[fused_name] = torch.cat((weights[q_a_name], weights[kv_a_name]), dim=0)
weights.pop(q_a_name)
weights.pop(kv_a_name)

Copilot uses AI. Check for mistakes.
Comment on lines 820 to 823
# Pre-compute quant info for efficiency (weights don't change during inference)
self._quant_info = TritonMoeQuantInfo(
w13_weight=base_layer.w13_weight,
w2_weight=base_layer.w2_weight,
b13=getattr(base_layer, "w13_weight_bias", None),
b2=getattr(base_layer, "w2_weight_bias", None),
)
self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer)

def set_lora_info(
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FusedMoEWithLoRA caches self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer), but some MoE quant methods (e.g. ModelOptFp8MoEMethod in layers/quantization/modelopt_quant.py) construct a non-default TritonMoeQuantInfo inside apply() and do not override get_triton_quant_info(). In that case LoRA MoE will run Triton kernels with the default unquantized descriptor, which can produce incorrect outputs or crashes when weights/scales are actually FP8/NVFP4/etc. Please ensure every MoE quant method that can be used with the Triton MoE runner overrides get_triton_quant_info() to return the same descriptor it uses in apply(), or add a guard here to fail fast if get_triton_quant_info() is not implemented for the active quant method.

Copilot uses AI. Check for mistakes.
@yushengsu-thu yushengsu-thu changed the title [Lora] Lora quat info re-factor and support deep seek v3 mla lora [Lora] Lora quat info re-factor and support deepseekv3 mla lora Apr 8, 2026
Comment thread python/sglang/srt/layers/quantization/base_config.py Outdated
Comment thread test/registered/lora/test_lora_deepseek_v3_base_logprob_diff.py Outdated
@Fridge003
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yushengsu-thu
Copy link
Copy Markdown
Collaborator Author

stage-b-test-4-gpu-b200

Screenshot 2026-04-09 at 11 35 41 AM

@Fridge003 Fridge003 merged commit 28ef6de into sgl-project:main Apr 9, 2026
27 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants