[Lora] Lora kimi support by yushengsu-thu · Pull Request #22381 · sgl-project/sglang

yushengsu-thu · 2026-04-08T19:57:38Z

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

When adapter_config.json uses PEFT shorthands like "all-linear" or "all", SGLang previously required users to explicitly specify --lora-target-modules on the CLI. This change adds a model-scanning approach that inspects the loaded base model to discover all LoRA-compatible linear modules automatically. Changes: - utils.py: add auto_detect_lora_target_modules() that walks the model graph, collects LinearBase/FusedMoE/ParallelLMHead module suffixes, normalizes them, and filters to the set supported by get_hidden_dim and init_buffers. - lora_manager.py: in init_lora_shapes(), resolve "all-linear"/"all" via model scanning instead of raising ValueError when CLI target modules are not provided. In init_lora_modules(), guard against modules outside decoder layers (layer_id is None) to prevent TypeError on non-layer modules. Made-with: Cursor

…fallbacks 1. layers.py: fix RowParallelLinearWithLoRA bias handling to pass bias into quant_method.apply(), matching base RowParallelLinear behavior; add interleaved gate/up layout support in FusedMoEWithLoRA for models using gemm1_alpha (e.g. gpt-oss-20b) 2. mem_pool.py: zero-initialize all LoRA buffers (torch.empty -> torch.zeros) to prevent garbage values in unused slots 3. utils.py: fall back to config.intermediate_size when moe_intermediate_size is not available in get_hidden_dim (supports GptOss, Mixtral, OLMoE, PhiMoE, GraniteMoE, Grok, etc.); accept PEFT shorthand "all-linear" in get_normalized_target_modules; fix isinstance order in auto_detect_lora_target_modules so ParallelLMHead is checked before VocabParallelEmbedding 4. gpt_oss.py: add should_apply_lora() to GptOssForCausalLM for explicit LoRA module filtering, consistent with Qwen3VLMoe Made-with: Cursor

Regression test comparing SGLang LoRA logprobs against reference training logprobs (KL threshold 1e-2). Uses 8-GPU H200 suite with triton MoE runner and shared outer LoRA mode. Adapter checkpoint: yushengsu/lora-diff-gpt-oss-20b Made-with: Cursor

Pre-allocate MoE intermediate buffers before memory profiling so KV cache sizing accounts for them. Reuse fixed buffers during capture/replay instead of dynamic torch.empty() allocations.

…raph

Extract get_triton_quant_info() into FusedMoEMethodBase and each quant method (Fp8, W8A8Fp8, W8A8Int8, BlockInt8, MoeWNA16, Unquantized) so FusedMoEWithLoRA uses the polymorphic method instead of hardcoding TritonMoeQuantInfo. Enables LoRA on quantized MoE models. Made-with: Cursor

- Add ReplicatedLinearWithLoRA for fused_qkv_a_proj_with_mqa, applying LoRA B via two separate sgemm calls for unequal output partitions (q_a_proj=1536 vs kv_a_proj_with_mqa=576). B slices are precomputed in set_lora_info to avoid per-forward allocation. - Add normalize_fused_qkv_a_proj to fuse q_a_proj + kv_a_proj_with_mqa adapter weights into a single stacked entry. - Add stack_num parameter to run_lora_a_sgemm across all 3 backends. - Fix o_proj hidden dim to use v_head_dim for MLA models. - Fix gate_up_proj/down_proj hidden dim to use per-layer shared expert intermediate size on MoE layers. - Exclude ReplicatedLinear from TP sharding in memory pool allocation. Made-with: Cursor

Made-with: Cursor

- Force triton-compatible MoE weights when LoRA is enabled for compressed tensors quantized models (avoid Marlin path which is incompatible) - Refactor get_triton_quant_info into a reusable method for LoRA MoE runner - Make MoE LoRA runner backend detection robust with hasattr fallback - Handle multi-modal model configs via get_text_config() in LoRAManager - Add CI test for Kimi-K2.5 LoRA logprob accuracy Made-with: Cursor

gemini-code-assist · 2026-04-08T19:57:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot

Pull request overview

This PR extends SGLang’s LoRA support to cover Kimi-K2.5 / DeepSeek-style MLA fused projections and improves MoE+LoRA compatibility across multiple quantization backends, with new CUDA-registered regression tests validating LoRA logprob accuracy.

Changes:

Add LoRA handling for DeepSeek MLA fused projection (fused_qkv_a_proj_with_mqa) including target-module normalization, buffer sizing, weight normalization/fusion, and a new ReplicatedLinearWithLoRA wrapper.
Refactor MoE quantization info plumbing via a new get_triton_quant_info() hook so LoRA MoE runner can consume correct quant metadata across quant methods.
Add registered regression tests for Kimi-K2.5 and DeepSeek-V3.1-Base LoRA logprob accuracy vs reference datasets.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/registered/lora/test_lora_kimi_k25_logprob_diff.py	New CI-registered Kimi-K2.5 LoRA logprob regression test using HF dataset reference.
test/registered/lora/test_lora_deepseek_v3_base_logprob_diff.py	New CI-registered DeepSeek-V3.1-Base LoRA logprob regression test with input/output normalization guards.
python/sglang/srt/lora/utils.py	Extend target-module normalization and hidden-dim logic for MLA fused projections and MoE shared-expert dims.
python/sglang/srt/lora/mem_pool.py	Adjust TP sharding rules for replicated fused projections; improve shared-outer MoE buffer zeroing behavior.
python/sglang/srt/lora/lora.py	Add fusion/normalization step to combine q_a + kv_a LoRA weights into fused MLA layout.
python/sglang/srt/lora/lora_manager.py	Use `get_text_config()` for VLM configs; initialize fused MLA LoRA modules with partition boundary metadata.
python/sglang/srt/lora/layers.py	Add `ReplicatedLinearWithLoRA`; refactor MoE LoRA runner init to use `get_triton_quant_info()` and handle missing runner fields.
python/sglang/srt/lora/backend/triton_backend.py	Add `stack_num` parameter passthrough for LoRA-A SGEMM.
python/sglang/srt/lora/backend/torch_backend.py	Add `stack_num` support wired into `num_slices` for LoRA-A ops.
python/sglang/srt/lora/backend/chunked_backend.py	Add `stack_num` support wired into chunked shrink op `num_slices`.
python/sglang/srt/layers/quantization/base_config.py	Introduce default `FusedMoEMethodBase.get_triton_quant_info()` API.
python/sglang/srt/layers/quantization/w8a8_int8.py	Factor Triton quant-info construction into `get_triton_quant_info()`.
python/sglang/srt/layers/quantization/w8a8_fp8.py	Factor Triton quant-info construction into `get_triton_quant_info()`.
python/sglang/srt/layers/quantization/unquant.py	Use `get_triton_quant_info()` in Triton MoE path (XPU).
python/sglang/srt/layers/quantization/moe_wna16.py	Add `get_triton_quant_info()` and reuse it in apply().
python/sglang/srt/layers/quantization/fp8.py	Add `get_triton_quant_info()` and reuse it in Triton MoE path.
python/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16_moe.py	Add `get_triton_quant_info()` and reuse it in apply_weights().
python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py	When LoRA is enabled, force Triton-compatible WNA16 MoE scheme; expose `get_triton_quant_info()` passthrough.
python/sglang/srt/layers/quantization/blockwise_int8.py	Add `get_triton_quant_info()` and reuse it in apply().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T20:04:57Z

+            kv_a_weight = (
+                weights[kv_a_name]
+                if kv_a_name in weights
+                else torch.zeros_like(weights[q_a_name])
+            )
+
+            weights[fused_name] = torch.cat((weights[q_a_name], kv_a_weight), dim=0)


In normalize_fused_qkv_a_proj, the fallback for missing kv_a_proj_with_mqa uses torch.zeros_like(weights[q_a_name]). This is only safe for LoRA A (where q/kv LoRA-A shapes match), but for LoRA B the q_a and kv_a output dims differ, so zeros_like will produce the wrong shape and the subsequent torch.cat will create a fused B with an incorrect output dimension (leading to buffer shape mismatches or silent misalignment). Consider handling lora_A vs lora_B separately: for lora_B, either (a) require kv_a_name to exist and raise a clear error if missing, or (b) allocate zeros with the correct kv output dim derived from base_hf_config (kv_lora_rank + qk_rope_head_dim) and the adapter rank.

Copilot · 2026-04-08T20:04:58Z

+        qm = base_layer.quant_method
+        if hasattr(qm, "runner") and qm.runner is not None:
+            runner_backend = qm.runner.runner_backend
+        else:
+            runner_backend = MoeRunnerBackend.TRITON

        self._lora_runner = MoeRunner(
-            base_layer.quant_method.runner.runner_backend,
+            runner_backend,
            base_layer.moe_runner_config,
            lora_enabled=True,
        )

        # Pre-compute quant info for efficiency (weights don't change during inference)
-        self._quant_info = TritonMoeQuantInfo(
-            w13_weight=base_layer.w13_weight,
-            w2_weight=base_layer.w2_weight,
-            b13=getattr(base_layer, "w13_weight_bias", None),
-            b2=getattr(base_layer, "w2_weight_bias", None),
-        )
+        self._quant_info = base_layer.quant_method.get_triton_quant_info(base_layer)



FusedMoEWithLoRA currently falls back to MoeRunnerBackend.TRITON when the quant method has no runner, and then always builds _quant_info via quant_method.get_triton_quant_info(). For quant methods whose MoE weights are not Triton-compatible (e.g., BitsAndBytesMoEMethod stores packed uint8 weights and doesn’t create a runner), this change can silently route execution into the Triton MoE runner with an invalid TritonMoeQuantInfo, likely producing incorrect results or runtime failures. Please add an explicit compatibility check here (e.g., require qm to expose a supported runner backend or a dedicated flag indicating triton-kernel compatibility) and raise a clear error when LoRA+MoE is requested with an unsupported quant method.

yushengsu-thu · 2026-04-09T21:49:44Z

/tag-run-ci-label

yushengsu-thu · 2026-04-09T23:02:59Z

Currently, the logprb is bite large but we have three PRs will be merged to fix this class ReplicatedLinearWithLoRA(BaseLayerWithLoRA) (dsk, kimi lora use this) and will mitigate this gap soon:
#21858
#21863
#21864

yushengsu-thu · 2026-04-10T00:16:19Z

/rerun-failed-ci

yushengsu-thu · 2026-04-10T00:19:41Z

/rerun-failed-ci

yushengsu-thu · 2026-04-10T01:14:44Z

/rerun-failed-ci

yushengsu-thu · 2026-04-10T04:26:02Z

/rerun-failed-ci

yushengsu-thu and others added 30 commits March 26, 2026 00:19

add ci

4f30d8c

pre-commit

add28dc

support shared lora foramte and qwen3_30b_a3b_instruct_2507

767655c

pre-commit

3ea7296

update

4fd3f01

sgl-project#21439 fixed issue

39c6445

pre-commit

a7f6a64

Support CUDA graph capture/replay for MoE LoRA

581aeb4

Pre-allocate MoE intermediate buffers before memory profiling so KV cache sizing accounts for them. Reuse fixed buffers during capture/replay instead of dynamic torch.empty() allocations.

enable cuda graph test

6bba4d7

run pre-sommit

3d1b576

merge

031d3dd

enable gc in test

8d09a3e

update

b1c263c

Merge remote-tracking branch 'upstream/main' into lora-support-cuda-g…

5023157

…raph

nit fix

38e9283

nit fix

36680f9

Merge branch 'main' into lora-support-cuda-graph

88259b5

Merge branch 'main' into lora-support-cuda-graph

5727aad

fix lnit

1bd81e4

Merge branch 'main' into lora-support-cuda-graph

788d40c

Add CI test for DeepSeek-V3.1-Base MLA LoRA logprob accuracy

17d373c

Made-with: Cursor

merge

f363e27

pre-commit

13cd5af

fix

020959a

github-actions Bot added quant LLM Quantization lora deepseek labels Apr 8, 2026

Copilot started reviewing on behalf of yushengsu-thu April 8, 2026 19:58 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

yushengsu-thu added 4 commits April 8, 2026 22:00

update

6bfe2e9

update

52ca7a2

update

5e14c9a

merge

55bd3f3

yushengsu-thu removed the deepseek label Apr 9, 2026

github-actions Bot added the run-ci label Apr 9, 2026

yushengsu-thu added the high priority label Apr 9, 2026

yushengsu-thu added 2 commits April 9, 2026 22:39

update

1aeb76d

tiny fix

f492ee8

yushengsu-thu added high priority and removed high priority labels Apr 9, 2026

Merge branch 'main' into lora-kimi-support

3059eb3

Fridge003 approved these changes Apr 10, 2026

View reviewed changes

Fridge003 merged commit 6d79c60 into sgl-project:main Apr 10, 2026
177 of 241 checks passed

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

[Lora] Lora kimi support (#22381)

53c100b

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[Lora] Lora kimi support (sgl-project#22381)

06f770a

yushengsu-thu added a commit that referenced this pull request Apr 17, 2026

[Lora] Lora kimi support (#22381)

f118faf

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Lora] Lora kimi support (sgl-project#22381)

3517b8a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lora] Lora kimi support#22381

[Lora] Lora kimi support#22381
Fridge003 merged 37 commits intosgl-project:mainfrom
yushengsu-thu:lora-kimi-support

yushengsu-thu commented Apr 8, 2026

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026 •

edited

Loading

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yushengsu-thu commented Apr 8, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

yushengsu-thu commented Apr 9, 2026

Uh oh!

yushengsu-thu commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

yushengsu-thu commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yushengsu-thu commented Apr 9, 2026 •

edited

Loading