Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE#21339
Merged
ch-wan merged 34 commits intosgl-project:mainfrom Apr 10, 2026
Merged
Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE#21339ch-wan merged 34 commits intosgl-project:mainfrom
ch-wan merged 34 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…oe_runner Dissolve FlashInferCuteDslMoE into the moe_runner pattern, aligning with the MoE refactor roadmap (sgl-project#8715). CuteDSL FP4 now flows through FusedMoE -> StandardDispatcher -> MoeRunner -> @register_fused_func, matching the flashinfer_trtllm integration and unblocking future A2A backend support.
Collaborator
|
@ch-wan could you review this? Thanks! |
ch-wan
reviewed
Apr 7, 2026
| self, layer: torch.nn.Module, moe_runner_config: MoeRunnerConfig | ||
| ): | ||
| self.moe_runner_config = moe_runner_config | ||
| if self.enable_flashinfer_cutedsl_moe: |
Collaborator
There was a problem hiding this comment.
Should we provide a default runner when it is auto (or not defined in server args)?
Collaborator
Author
There was a problem hiding this comment.
Added FLASHINFER_TRTLLM as the default runner for auto: edf3e16
Collaborator
|
/tag-and-rerun-ci |
Contributor
|
/tag-and-rerun-ci (just move test forward, thx~) |
Contributor
|
Hi @leejnau , could you please give a fix for the failed ci case? thx! https://github.com/sgl-project/sglang/actions/runs/24141598046/job/70546144580?pr=21339#step:7:2420 |
ch-wan
approved these changes
Apr 10, 2026
Fridge003
pushed a commit
that referenced
this pull request
Apr 11, 2026
pyc96
pushed a commit
to pyc96/sglang
that referenced
this pull request
Apr 14, 2026
5 tasks
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
We want to have the option of having a more standard
--moe-runner-backend flashinfer_cutedslbackend that is not specified to DeepEP. This PR integrates the Wrapper API exposed here: flashinfer-ai/flashinfer#2398.Modifications
Server
flashinfer_cutedslas a modularmoe_runnerbackend for--moe-runner-backend flashinfer_cutedslwithmodelopt_fp4quantization on the standard path (moe_a2a_backend=none). Implements themoe_runnerpattern established byflashinfer_trtllm.py: aCuteDslFp4MoeQuantInfodataclass carries weights/scales/wrapper, and a@register_fused_func("none", "flashinfer_cutedsl")handles FP4 quantization +CuteDslMoEWrapper.run(), aligning with the MoE refactor roadmap ([Roadmap] MoE Refactor #8715).FusedMoE->StandardDispatcher->MoeRunnerpipeline instead of a standalone layer class, unblocking futuremoe_a2a_backend=flashinfersupport.moe_sorthandles EP with global expert IDs internally, soskip_local_expert_mappingis enabled.torch.inference_mode()totorch.no_grad(), since CuteDSL lazily allocates persistent CUDA graph buffers during the first forward pass.process_weights_after_loading._resolve_cutedsl_standard_scales) that derives correct per-expert GEMM alphas from scalarized activation scales, handling EP slicing and multiple checkpoint formats.Tests
test/registered/moe/test_cutedsl_moe.py: unit tests for wrapper accuracy vs. PyTorch reference, CUDA graph parity, and EP-sharded all-reduce correctness.test/registered/backends/test_deepseek_v3_fp4_cutedsl_moe.py: end-to-end GPQA accuracy on DeepSeek-V3 FP4 for both EP=1 and EP=4 configurations (nightly, 4 GPU B200).flashinfer_trtllmandflashinfer_cutlassbackends.Accuracy Tests
server (EP1):
server (EP8):
client:
results (EP1):
results (EP8):
Benchmarking and Profiling
Generally this PR outperforms
--moe-runner flashinfer_cutlassbut not--moe-runner flashinfer_trtllm.Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci