Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE by leejnau · Pull Request #21339 · sgl-project/sglang

leejnau · 2026-03-24T22:04:44Z

Motivation

We want to have the option of having a more standard --moe-runner-backend flashinfer_cutedsl backend that is not specified to DeepEP. This PR integrates the Wrapper API exposed here: flashinfer-ai/flashinfer#2398.

Modifications

Server

Add flashinfer_cutedsl as a modular moe_runner backend for --moe-runner-backend flashinfer_cutedsl with modelopt_fp4 quantization on the standard path (moe_a2a_backend=none). Implements the moe_runner pattern established by flashinfer_trtllm.py: a CuteDslFp4MoeQuantInfo dataclass carries weights/scales/wrapper, and a @register_fused_func("none", "flashinfer_cutedsl") handles FP4 quantization + CuteDslMoEWrapper.run(), aligning with the MoE refactor roadmap ([Roadmap] MoE Refactor #8715).
Route through the generic FusedMoE -> StandardDispatcher -> MoeRunner pipeline instead of a standalone layer class, unblocking future moe_a2a_backend=flashinfer support.
Support EP=1 (TP-sharded experts) and EP=TP (partitioned experts with all-reduce) configurations. CuteDSL's moe_sort handles EP with global expert IDs internally, so skip_local_expert_mapping is enabled.
Enable FlashInfer autotuning for CuteDSL and switch the autotune warmup from torch.inference_mode() to torch.no_grad(), since CuteDSL lazily allocates persistent CUDA graph buffers during the first forward pass.
Add FP4 weight preprocessing: post-quant W1 gate/up interleave and swizzled-to-MMA block-scale conversion during process_weights_after_loading.
Add scale resolution logic (_resolve_cutedsl_standard_scales) that derives correct per-expert GEMM alphas from scalarized activation scales, handling EP slicing and multiple checkpoint formats.
Preserve the existing DeepEP low-latency masked CuteDSL route (unchanged, already on main).

Tests

test/registered/moe/test_cutedsl_moe.py: unit tests for wrapper accuracy vs. PyTorch reference, CUDA graph parity, and EP-sharded all-reduce correctness.
test/registered/backends/test_deepseek_v3_fp4_cutedsl_moe.py: end-to-end GPQA accuracy on DeepSeek-V3 FP4 for both EP=1 and EP=4 configurations (nightly, 4 GPU B200).
Verify no regressions on existing flashinfer_trtllm and flashinfer_cutlass backends.

Accuracy Tests

server (EP1):

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-NVFP4-v2 --tensor-parallel-size=8 --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --ep-size 1 --moe-runner-backend flashinfer_cutedsl --quantization modelopt_fp4

server (EP8):

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-NVFP4-v2 --tensor-parallel-size=8 --cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --ep-size 8 --moe-runner-backend flashinfer_cutedsl --quantization modelopt_fp4

client:

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3

results (EP1):

'scores': ['0.813', '0.818', '0.803', '0.823', '0.773', '0.798', '0.828', '0.763'], 'mean_score': np.float64(0.8023989898989901)

results (EP8):

'scores': ['0.788', '0.818', '0.778', '0.803', '0.793', '0.793', '0.808', '0.808'], 'mean_score': np.float64(0.798611111111111)

Benchmarking and Profiling

Generally this PR outperforms --moe-runner flashinfer_cutlass but not --moe-runner flashinfer_trtllm.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-24T22:04:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…oe_runner Dissolve FlashInferCuteDslMoE into the moe_runner pattern, aligning with the MoE refactor roadmap (sgl-project#8715). CuteDSL FP4 now flows through FusedMoE -> StandardDispatcher -> MoeRunner -> @register_fused_func, matching the flashinfer_trtllm integration and unblocking future A2A backend support.

…dsl.py

nvpohanh · 2026-04-07T00:49:05Z

@ch-wan could you review this? Thanks!

ch-wan · 2026-04-07T07:50:23Z

        self, layer: torch.nn.Module, moe_runner_config: MoeRunnerConfig
    ):
        self.moe_runner_config = moe_runner_config
+        if self.enable_flashinfer_cutedsl_moe:


Should we provide a default runner when it is auto (or not defined in server args)?

Added FLASHINFER_TRTLLM as the default runner for auto: edf3e16

ch-wan · 2026-04-08T05:43:11Z

/tag-and-rerun-ci

samuellees · 2026-04-09T12:31:58Z

/tag-and-rerun-ci (just move test forward, thx~)

samuellees · 2026-04-09T13:05:01Z

Hi @leejnau , could you please give a fix for the failed ci case? thx! https://github.com/sgl-project/sglang/actions/runs/24141598046/job/70546144580?pr=21339#step:7:2420

…21339)

…gl-project#21339)

leejnau added 22 commits February 19, 2026 14:07

add cutedsl backend validation and fallback logs

17980cd

add standard flashinfer_cutedsl fp4 moe execution path

e3a8231

stabilize flashinfer_cutedsl standard MoE path

ac43f08

debug cutedsl parity

93bdb7a

cache and cast optimizations

2e63e50

add cutedsl standard path backend test

86fe425

update test_cutedsl_moe.py to use CuteDslMoEWrapper

83fc0c5

update bench moe backend to do full benchmarking

8fbf237

disable shared expert fusion for CuteDSL MoE

3d9d16e

update benchmark script: server restart; less reqs

79adf42

remove cutedsl allreduce fusion exclusion

4b43372

adjust concurrency cap for backend compare

3a5c9bf

enable autotuning for flashinfer_cutedsl

527e804

replace torch.inference_mode() with no_grad() in flashinfer autotune

6498d7b

remove mmlu sanity from backend benchmark

5eb0f29

initial implementation of standalone v2 cutedsl moe fp4 backend

c14392d

relax bitwise identical requirement of cutedsl moe tests

6dcee83

remove v1 backend and make v2 backend the default

0edf574

refactor tests

b441ac4

update _num_prompts_for_concurrency in backend compare

89f1f04

remove one-off profiling and testing scripts

9f6e367

final cleanup before PR

a341d91

github-actions Bot added quant LLM Quantization deepseek labels Mar 24, 2026

leejnau added 5 commits March 25, 2026 14:30

remove erroneous is_weight

41c83e1

extract CuteDSL utilities from modelopt_quant.py into flashinfer_cute…

a969b59

…dsl.py

Merge branch 'main' into integrate_flashinfer_cutedsl_moe

fc75f8d

fix 2D w13_weight_scale_2 compatibility

1bb8c5d

leejnau requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, b8zhong, ch-wan, hnyls2002 and ispobock as code owners March 31, 2026 18:17

ch-wan reviewed Apr 7, 2026

View reviewed changes

leejnau added 2 commits April 7, 2026 09:24

Merge branch 'main' into integrate_flashinfer_cutedsl_moe

9cfb267

resolve moe_runner_backend=auto to flashinfer_trtllm for modelopt_fp4

edf3e16

github-actions Bot added the run-ci label Apr 8, 2026

Merge branch 'main' into integrate_flashinfer_cutedsl_moe

80ce059

fix CI failure for CUTLASS MoeRunner

8a3ccee

ch-wan approved these changes Apr 10, 2026

View reviewed changes

ch-wan merged commit c554dc5 into sgl-project:main Apr 10, 2026
340 of 378 checks passed

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE (#…

e43fa72

…21339)

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE (s…

3ae8e19

…gl-project#21339)

hlu1 mentioned this pull request Apr 14, 2026

[Feature] Large-EP MoE Redesign #22829

Open

trevor-m mentioned this pull request Apr 16, 2026

Recipe bug: flashinfer_cutedsl moe-runner-backend incompatible with deepep a2a-backend NVIDIA/srt-slurm#39

Closed

leejnau mentioned this pull request Apr 16, 2026

fix legacy deepep path for flashinfer_cutedsl #22925

Merged

5 tasks

nvpohanh mentioned this pull request Apr 17, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

leejnau deleted the integrate_flashinfer_cutedsl_moe branch April 20, 2026 16:35

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE (s…

d0f3168

…gl-project#21339)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE#21339

Add dedicated FlashInferCuteDslMoE layer for standard-path FP4 MoE#21339
ch-wan merged 34 commits intosgl-project:mainfrom
leejnau:integrate_flashinfer_cutedsl_moe

leejnau commented Mar 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

nvpohanh commented Apr 7, 2026

Uh oh!

ch-wan Apr 7, 2026

Uh oh!

leejnau Apr 7, 2026

Uh oh!

ch-wan commented Apr 8, 2026

Uh oh!

samuellees commented Apr 9, 2026 •

edited

Loading

Uh oh!

samuellees commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

leejnau commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Server

Tests

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

nvpohanh commented Apr 7, 2026

Uh oh!

ch-wan Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

leejnau Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ch-wan commented Apr 8, 2026

Uh oh!

samuellees commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuellees commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leejnau commented Mar 24, 2026 •

edited

Loading

samuellees commented Apr 9, 2026 •

edited

Loading