[lora][moe] Virtual experts for LoRA MoE#22122
[lora][moe] Virtual experts for LoRA MoE#22122yushengsu-thu merged 8 commits intosgl-project:mainfrom
Conversation
Refactor LoRA MoE runner from per-backend subclass (TritonRunnerCoreWithLoRA) to a generic hook-based injection pattern, decoupling LoRA logic from the base MoE backend. Add Marlin int4/int8 MoE backend for LoRA.
Refactor LoRA MoE runner from per-backend subclass (TritonRunnerCoreWithLoRA) to a generic hook-based injection pattern, decoupling LoRA logic from the base MoE backend. Add Marlin int4/int8 MoE backend for LoRA.
Add virtual expert computation for LoRA+MoE: treats (adapter, expert) pairs as a flat virtual_num_experts space, allowing LoRA deltas to be computed by reusing existing fused MoE kernels. Includes split-K support for better GPU utilization. Enabled via --lora-use-virtual-experts flag.
There was a problem hiding this comment.
Code Review
This pull request introduces support for LoRA injection in MoE models using virtual experts, enabling more efficient LoRA integration across different backends including Triton and Marlin. It adds hook-based injection points in the MoE pipeline, updates the runner infrastructure to support these hooks, and includes a new Marlin-based runner core. The changes also introduce a virtual expert routing mechanism to handle LoRA adapters and provide comprehensive tests for correctness.
| assert ( | ||
| not fuse_sum_all_reduce | ||
| ), "fuse_add_to_output and fuse_sum_all_reduce are mutually exclusive" | ||
| assert ( | ||
| add_output_mask is not None | ||
| ), "add_output_mask required when fuse_add_to_output=True" |
There was a problem hiding this comment.
The assertion assert add_output_mask is not None is redundant because add_output_mask is already type-hinted as Optional[torch.Tensor] and the function signature defaults it to None. If fuse_add_to_output is True, it is better to handle the missing mask gracefully or raise a more descriptive error if the tensor is required for the kernel logic.
| if lora_info is None or lora_info.max_lora_rank == 0: | ||
| return LoRAHooks() |
| global _MARLIN_WORKSPACE | ||
| from sglang.srt.layers.moe.token_dispatcher.standard import StandardCombineInput | ||
|
|
||
| assert hooks is not None, "hooks must be provided for MarlinLoraRunnerCore" |
|
/tag-run-ci-label |
|
/tag-run-ci-label |
1 similar comment
|
/tag-run-ci-label |
|
/tag-run-ci-label |
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Co-authored-by: Yusheng Su <yushengsu.thu@gmail.com>
Motivation
NOTE: depends on via the hooks-based architecture in #21858
This PR introduces virtual expert computation for LoRA+MoE: instead of iterating over each LoRA adapter separately (one alignment + kernel call per adapter), we treat
[num_loras, num_experts]weight combinations as a flat[virtual_num_experts]space. This allows LoRA deltas to be computed in a single fused MoE kernel call by reusing the existinginvoke_fused_moe_kernelinfrastructure, significantly reducing kernel launch overhead for multi-adapter serving.Enabled via
--lora-use-virtual-experts.Modifications
lora/triton_ops/virtual_experts.py) that maps(lora_adapter, expert)pairs into virtual expert IDs, flattens LoRA weights from[max_loras, num_experts, ...]to[max_loras * num_experts, ...], and runs fused MoE for LoRA A and B in a single pass._compute_token_lora_mappingmaps each token to its adapter index for the virtual routing.fused_moe_triton_kernels.py: addedlora_num_experts_overrideto allow virtual experts to override the expert count in the align kernel, andfuse_add_to_output/add_output_maskfor masked in-place addition (tokens with no LoRA adapter are skipped).--lora-use-virtual-expertsflag inserver_args.py, propagated throughlora_manager.pyandlayers.pytoLoRAInfo.direct_register_custom_op; therouting_cachedict (not supported bytorch.library.infer_schema) is handled by a thin wrapper.Accuracy Tests
All 16
test_lora_moe_runner_virtual_expertsparametrized configs pass — each verifies that the virtual experts path produces the same LoRA delta as the per-adapter baseline.Checklist