[lora][moe] Virtual experts for LoRA MoE by klshuster · Pull Request #22122 · sgl-project/sglang

klshuster · 2026-04-04T21:07:45Z

Motivation

NOTE: depends on via the hooks-based architecture in #21858

This PR introduces virtual expert computation for LoRA+MoE: instead of iterating over each LoRA adapter separately (one alignment + kernel call per adapter), we treat [num_loras, num_experts] weight combinations as a flat [virtual_num_experts] space. This allows LoRA deltas to be computed in a single fused MoE kernel call by reusing the existing invoke_fused_moe_kernel infrastructure, significantly reducing kernel launch overhead for multi-adapter serving.

Enabled via --lora-use-virtual-experts.

Modifications

New Triton kernel (lora/triton_ops/virtual_experts.py) that maps (lora_adapter, expert) pairs into virtual expert IDs, flattens LoRA weights from [max_loras, num_experts, ...] to [max_loras * num_experts, ...], and runs fused MoE for LoRA A and B in a single pass.
Split-K support for the virtual experts kernel for better GPU utilization.
_compute_token_lora_mapping maps each token to its adapter index for the virtual routing.
fused_moe_triton_kernels.py: added lora_num_experts_override to allow virtual experts to override the expert count in the align kernel, and fuse_add_to_output / add_output_mask for masked in-place addition (tokens with no LoRA adapter are skipped).
--lora-use-virtual-experts flag in server_args.py, propagated through lora_manager.py and layers.py to LoRAInfo.
Registered as a custom op via direct_register_custom_op; the routing_cache dict (not supported by torch.library.infer_schema) is handled by a thin wrapper.

Accuracy Tests

All 16 test_lora_moe_runner_virtual_experts parametrized configs pass — each verifies that the virtual experts path produces the same LoRA delta as the per-adapter baseline.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Refactor LoRA MoE runner from per-backend subclass (TritonRunnerCoreWithLoRA) to a generic hook-based injection pattern, decoupling LoRA logic from the base MoE backend. Add Marlin int4/int8 MoE backend for LoRA.

Add virtual expert computation for LoRA+MoE: treats (adapter, expert) pairs as a flat virtual_num_experts space, allowing LoRA deltas to be computed by reusing existing fused MoE kernels. Includes split-K support for better GPU utilization. Enabled via --lora-use-virtual-experts flag.

gemini-code-assist

Code Review

This pull request introduces support for LoRA injection in MoE models using virtual experts, enabling more efficient LoRA integration across different backends including Triton and Marlin. It adds hook-based injection points in the MoE pipeline, updates the runner infrastructure to support these hooks, and includes a new Marlin-based runner core. The changes also introduce a virtual expert routing mechanism to handle LoRA adapters and provide comprehensive tests for correctness.

gemini-code-assist · 2026-04-04T21:11:17Z

+        assert (
+            not fuse_sum_all_reduce
+        ), "fuse_add_to_output and fuse_sum_all_reduce are mutually exclusive"
+        assert (
+            add_output_mask is not None
+        ), "add_output_mask required when fuse_add_to_output=True"


The assertion assert add_output_mask is not None is redundant because add_output_mask is already type-hinted as Optional[torch.Tensor] and the function signature defaults it to None. If fuse_add_to_output is True, it is better to handle the missing mask gracefully or raise a more descriptive error if the tensor is required for the kernel logic.

gemini-code-assist · 2026-04-04T21:11:17Z

+    if lora_info is None or lora_info.max_lora_rank == 0:
+        return LoRAHooks()


Returning an empty LoRAHooks() object when lora_info is None or rank is 0 is correct, but consider if the caller expects None instead to avoid unnecessary object creation in hot paths.

gemini-code-assist · 2026-04-04T21:11:17Z

+        global _MARLIN_WORKSPACE
+        from sglang.srt.layers.moe.token_dispatcher.standard import StandardCombineInput
+
+        assert hooks is not None, "hooks must be provided for MarlinLoraRunnerCore"


The assertion assert hooks is not None is good, but consider providing a more informative error message that explains why hooks are required for this specific runner core.

yushengsu-thu · 2026-04-13T00:11:58Z

/tag-run-ci-label

yushengsu-thu · 2026-04-13T02:46:44Z

/tag-run-ci-label

yushengsu-thu · 2026-04-13T02:55:23Z

/tag-run-ci-label

yushengsu-thu · 2026-04-13T06:16:05Z

/tag-run-ci-label

yushengsu-thu · 2026-04-13T06:22:25Z

/rerun-failed-ci

yushengsu-thu · 2026-04-13T06:28:20Z

/rerun-failed-ci

yushengsu-thu · 2026-04-13T06:44:10Z

/rerun-failed-ci

yushengsu-thu · 2026-04-13T17:41:51Z

/rerun-failed-ci

yushengsu-thu · 2026-04-13T18:35:34Z

/rerun-failed-ci

yushengsu-thu · 2026-04-13T20:07:13Z

/rerun-failed-ci

yushengsu-thu · 2026-04-13T20:32:48Z

/rerun-failed-ci

Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>

This reverts commit ff13dfe.

Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>

klshuster added 3 commits April 4, 2026 20:54

[lora][moe] Decoupled LoRA MoE backend with Marlin support

220d2d5

Refactor LoRA MoE runner from per-backend subclass (TritonRunnerCoreWithLoRA) to a generic hook-based injection pattern, decoupling LoRA logic from the base MoE backend. Add Marlin int4/int8 MoE backend for LoRA.

[lora][moe] Decoupled LoRA MoE backend with Marlin support

5a77600

Refactor LoRA MoE runner from per-backend subclass (TritonRunnerCoreWithLoRA) to a generic hook-based injection pattern, decoupling LoRA logic from the base MoE backend. Add Marlin int4/int8 MoE backend for LoRA.

klshuster requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ch-wan, ispobock, lifuhuang, merrymercy and yushengsu-thu as code owners April 4, 2026 21:07

github-actions Bot added the lora label Apr 4, 2026

gemini-code-assist Bot reviewed Apr 4, 2026

View reviewed changes

yushengsu-thu self-assigned this Apr 4, 2026

merge

9d28ddb

yushengsu-thu added the high priority label Apr 13, 2026

github-actions Bot added the run-ci label Apr 13, 2026

yushengsu-thu added 3 commits April 13, 2026 00:28

fix merge conflict

6b8c875

fix merge bug

13aff50

fix merge bug

0451229

fix merge bug

c00cf8b

yushengsu-thu enabled auto-merge (squash) April 13, 2026 08:36

Fridge003 approved these changes Apr 13, 2026

View reviewed changes

yushengsu-thu approved these changes Apr 13, 2026

View reviewed changes

yushengsu-thu merged commit ff13dfe into sgl-project:main Apr 13, 2026
575 of 663 checks passed

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[lora][moe] Virtual experts for LoRA MoE (sgl-project#22122)

2db32e0

Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>

yushengsu-thu added a commit that referenced this pull request Apr 17, 2026

[lora][moe] Virtual experts for LoRA MoE (#22122)

c848d59

Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>

bingxche added a commit that referenced this pull request Apr 18, 2026

Revert "[lora][moe] Virtual experts for LoRA MoE (#22122)"

d1fe721

This reverts commit ff13dfe.

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[lora][moe] Virtual experts for LoRA MoE (sgl-project#22122)

d5e8a1b

Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>

		if lora_info is None or lora_info.max_lora_rank == 0:
		return LoRAHooks()

Conversation

klshuster commented Apr 4, 2026

Motivation

Modifications

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

yushengsu-thu commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants