[Lora] Lora quat info re-factor and support deepseekv3 mla lora#22323
[Lora] Lora quat info re-factor and support deepseekv3 mla lora#22323Fridge003 merged 35 commits intosgl-project:mainfrom
Conversation
When adapter_config.json uses PEFT shorthands like "all-linear" or "all", SGLang previously required users to explicitly specify --lora-target-modules on the CLI. This change adds a model-scanning approach that inspects the loaded base model to discover all LoRA-compatible linear modules automatically. Changes: - utils.py: add auto_detect_lora_target_modules() that walks the model graph, collects LinearBase/FusedMoE/ParallelLMHead module suffixes, normalizes them, and filters to the set supported by get_hidden_dim and init_buffers. - lora_manager.py: in init_lora_shapes(), resolve "all-linear"/"all" via model scanning instead of raising ValueError when CLI target modules are not provided. In init_lora_modules(), guard against modules outside decoder layers (layer_id is None) to prevent TypeError on non-layer modules. Made-with: Cursor
…fallbacks 1. layers.py: fix RowParallelLinearWithLoRA bias handling to pass bias into quant_method.apply(), matching base RowParallelLinear behavior; add interleaved gate/up layout support in FusedMoEWithLoRA for models using gemm1_alpha (e.g. gpt-oss-20b) 2. mem_pool.py: zero-initialize all LoRA buffers (torch.empty -> torch.zeros) to prevent garbage values in unused slots 3. utils.py: fall back to config.intermediate_size when moe_intermediate_size is not available in get_hidden_dim (supports GptOss, Mixtral, OLMoE, PhiMoE, GraniteMoE, Grok, etc.); accept PEFT shorthand "all-linear" in get_normalized_target_modules; fix isinstance order in auto_detect_lora_target_modules so ParallelLMHead is checked before VocabParallelEmbedding 4. gpt_oss.py: add should_apply_lora() to GptOssForCausalLM for explicit LoRA module filtering, consistent with Qwen3VLMoe Made-with: Cursor
Regression test comparing SGLang LoRA logprobs against reference training logprobs (KL threshold 1e-2). Uses 8-GPU H200 suite with triton MoE runner and shared outer LoRA mode. Adapter checkpoint: yushengsu/lora-diff-gpt-oss-20b Made-with: Cursor
Pre-allocate MoE intermediate buffers before memory profiling so KV cache sizing accounts for them. Reuse fixed buffers during capture/replay instead of dynamic torch.empty() allocations.
Extract get_triton_quant_info() into FusedMoEMethodBase and each quant method (Fp8, W8A8Fp8, W8A8Int8, BlockInt8, MoeWNA16, Unquantized) so FusedMoEWithLoRA uses the polymorphic method instead of hardcoding TritonMoeQuantInfo. Enables LoRA on quantized MoE models. Made-with: Cursor
- Add ReplicatedLinearWithLoRA for fused_qkv_a_proj_with_mqa, applying LoRA B via two separate sgemm calls for unequal output partitions (q_a_proj=1536 vs kv_a_proj_with_mqa=576). B slices are precomputed in set_lora_info to avoid per-forward allocation. - Add normalize_fused_qkv_a_proj to fuse q_a_proj + kv_a_proj_with_mqa adapter weights into a single stacked entry. - Add stack_num parameter to run_lora_a_sgemm across all 3 backends. - Fix o_proj hidden dim to use v_head_dim for MLA models. - Fix gate_up_proj/down_proj hidden dim to use per-layer shared expert intermediate size on MoE layers. - Exclude ReplicatedLinear from TP sharding in memory pool allocation. Made-with: Cursor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR refactors LoRA quant-info handling and adds DeepSeek V3 MLA LoRA support, while also improving LoRA behavior under CUDA Graph by introducing early MoE buffer preallocation and capture-mode execution changes.
Changes:
- Add DeepSeek MLA fused projection LoRA support (new normalized target module, hidden-dim logic, and adapter weight normalization).
- Rework LoRA + CUDA Graph integration (two-phase init, capture-mode gating, and shared MoE buffer preallocation).
- Improve CLI/test defaults around LoRA and CUDA Graph (e.g.,
--[no-]experts-shared-outer-loras, tests no longer forcingdisable_cuda_graph=True).
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
python/sglang/srt/lora/lora.py |
Adds fused-weight normalization for DeepSeek MLA-style adapters. |
python/sglang/srt/lora/utils.py |
Extends target-module normalization and dimension inference for new fused modules and embeddings. |
python/sglang/srt/lora/mem_pool.py |
Adjusts TP sharding rules for replicated LoRA modules and tightens validation for shared-expert outer-LoRA loads. |
python/sglang/srt/lora/layers.py |
Integrates MoE LoRA runner changes, CUDA-graph adapter masks/buffers, and caches Triton quant-info for MoE LoRA. |
python/sglang/srt/lora/lora_moe_runners.py |
Updates MoE runner control flow for capture-mode and optional CUDA-graph buffer reuse. |
python/sglang/srt/lora/backend/base_backend.py |
Introduces backend-agnostic MoE CUDA-graph buffer preallocation used by the Triton LoRA MoE path. |
python/sglang/srt/model_executor/model_runner.py |
Adds “Phase 1” LoRA CUDA-graph init to pre-allocate MoE buffers before memory profiling/pool init. |
python/sglang/srt/model_executor/cuda_graph_runner.py |
Documents “Phase 2” LoRA CUDA-graph init for dense batch metadata. |
python/sglang/srt/server_args.py |
Switches --experts-shared-outer-loras to a boolean optional flag with --no-... support. |
python/sglang/srt/lora/triton_ops/fused_moe_lora_kernel.py |
Tweaks kernel math to align operand dtypes for tl.dot. |
test/registered/lora/test_lora_qwen3_vl_30b_a3b_instruct_logprob_diff.py |
Stops forcing disable_cuda_graph=True in LoRA logprob regression. |
test/registered/lora/test_lora_qwen3_8b_logprob_diff.py |
Same as above. |
test/registered/lora/test_lora_qwen3_30b_a3b_instruct_2507_logprob_diff.py |
Same as above. |
test/registered/lora/test_lora_gpt_oss_20b_logprob_diff.py |
Same as above. |
Comments suppressed due to low confidence (1)
python/sglang/srt/lora/layers.py:855
- In CUDA-graph mode, adapter_enabled is always populated via index_fill_ from batch_info.weight_indices, even when batch_info.has_active_lora is False (base-only). But TritonRunnerCoreWithLoRA forces the LoRA path during capture, so this ends up enabling at least the base slot and prevents the intended “all-zeros adapter_enabled” early-exit behavior during capture. Consider skipping index_fill_ (leave adapter_enabled all zeros) when has_active_lora is False (and/or in capture mode) so capture records the kernels without doing full work for the base adapter.
adapter_enabled = cg_buffers["adapter_enabled"]
adapter_enabled.zero_()
idx_buf = cg_buffers["weight_indices_long"]
idx_buf[: batch_info.bs] = batch_info.weight_indices[: batch_info.bs]
adapter_enabled.index_fill_(0, idx_buf[: batch_info.bs], 1)
else:
adapter_enabled = torch.zeros(
len(lora_ranks), dtype=torch.int32, device=lora_ranks.device
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| kv_a_weight = ( | ||
| weights[kv_a_name] | ||
| if kv_a_name in weights | ||
| else torch.zeros_like(weights[q_a_name]) | ||
| ) | ||
|
|
||
| weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0) | ||
| weights.pop(q_a_name) | ||
| if kv_a_name in weights: | ||
| weights.pop(kv_a_name) |
There was a problem hiding this comment.
normalize_fused_qkv_a_proj falls back to torch.zeros_like(q_a) when kv_a_proj_with_mqa weights are missing. For LoRA B this is very likely the wrong shape because q_a_proj and kv_a_proj_with_mqa have different output dims (q_lora_rank vs kv_lora_rank+qk_rope_head_dim), which can lead to silently incorrect fused weight shapes or downstream shape/assert failures. Consider requiring both weights to be present (raise a clear error) or constructing a correctly-shaped zero tensor for the missing side based on base_hf_config dims and the existing weight’s rank/input dims.
| kv_a_weight = ( | |
| weights[kv_a_name] | |
| if kv_a_name in weights | |
| else torch.zeros_like(weights[q_a_name]) | |
| ) | |
| weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0) | |
| weights.pop(q_a_name) | |
| if kv_a_name in weights: | |
| weights.pop(kv_a_name) | |
| if q_a_name not in weights or kv_a_name not in weights: | |
| missing_weights = [] | |
| if q_a_name not in weights: | |
| missing_weights.append(q_a_name) | |
| if kv_a_name not in weights: | |
| missing_weights.append(kv_a_name) | |
| raise ValueError( | |
| "Cannot fuse LoRA qkv_a_proj weights: expected both " | |
| f"'{q_a_name}' and '{kv_a_name}' to be present when " | |
| f"building '{fused_name}', but missing {missing_weights}." | |
| ) | |
| weights[fused_name] = torch.cat((weights[q_a_name], weights[kv_a_name]), dim=0) | |
| weights.pop(q_a_name) | |
| weights.pop(kv_a_name) |
| # Pre-compute quant info for efficiency (weights don't change during inference) | ||
| self._quant_info = TritonMoeQuantInfo( | ||
| w13_weight=base_layer.w13_weight, | ||
| w2_weight=base_layer.w2_weight, | ||
| b13=getattr(base_layer, "w13_weight_bias", None), | ||
| b2=getattr(base_layer, "w2_weight_bias", None), | ||
| ) | ||
| self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer) | ||
|
|
||
| def set_lora_info( |
There was a problem hiding this comment.
FusedMoEWithLoRA caches self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer), but some MoE quant methods (e.g. ModelOptFp8MoEMethod in layers/quantization/modelopt_quant.py) construct a non-default TritonMoeQuantInfo inside apply() and do not override get_triton_quant_info(). In that case LoRA MoE will run Triton kernels with the default unquantized descriptor, which can produce incorrect outputs or crashes when weights/scales are actually FP8/NVFP4/etc. Please ensure every MoE quant method that can be used with the Triton MoE runner overrides get_triton_quant_info() to return the same descriptor it uses in apply(), or add a guard here to fail fast if get_triton_quant_info() is not implemented for the active quant method.
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |

Motivation
Modifications
MoE quant-info refactor
Extracts get_triton_quant_info() into each quantization method (FP8, INT8, WNA16, etc.) so FusedMoEWithLoRA correctly receives quantization scales/flags, enabling LoRA on quantized MoE models.
DeepSeek-V3 MLA LoRA support
Adds ReplicatedLinearWithLoRA to handle the fused q_a_proj + kv_a_proj_with_mqa projection unique to MLA. Since the two sub-projections have unequal output dimensions, LoRA B is split and applied via two separate sgemm calls. Includes weight fusion, hidden-dim fixes (v_head_dim for o_proj, shared-expert sizes for MoE layers), and TP exclusion for replicated layers.
CI: Adds a 5-GPU GB200 logprob accuracy test against DeepSeek-V3.1-Base.
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci