[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 by samuellees · Pull Request #13761 · sgl-project/sglang

samuellees · 2025-11-22T08:33:39Z

PR Dependency

[Merged] Depend on this PR13489
Rebased from PR13427

Motivation

Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) on Blackwell GPUs

TODO

Add unit test
Add command for reproduce accuracy results
Refactor this PR (Co-work with @kaixih PR13556)

Accuracy Tests

# Qwen3-Next: NVFP4 linear + NVFP4 MoE + FP8 Attention
export SGL_ENABLE_JIT_DEEPGEMM=false
python3 -m sglang.launch_server --model-path qwen3-next-80b-a3b-instruct-nvfp4-all --chunked-prefill-size 16384 --max-prefill-tokens 16384 --max-running-requests 512 --tp-size 4 --ep-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 --disable-radix-cache --log-level info --host 0.0.0.0 --port 8001 --random-seed 0  --quantization modelopt_fp4   --kv-cache-dtype fp8_e4m3  --moe-runner-backend flashinfer_trtllm --attention-backend trtllm_mha  --mamba-ssm-dtype bfloat16

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	8	exact_match	↑	0.9583	±	0.0055
		strict-match	8	exact_match	↑	0.7324	±	0.0122

# Qwen3-Next: FP8 linear + FP8 MoE + FP8 Attention
export SGL_ENABLE_JIT_DEEPGEMM=false
python3 -m sglang.launch_server --model-path Qwen3-Next/Qwen3-Next-80B-A3B-Instruct-FP8 --chunked-prefill-size 16384 --max-prefill-tokens 16384 --max-running-requests 512 --tp-size 4 --ep-size 4 --mem-fraction-static 0.7 --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 --disable-radix-cache --log-level info --host 0.0.0.0 --port 8001 --random-seed 0  --quantization fp8    --kv-cache-dtype fp8_e4m3  --moe-runner-backend flashinfer_trtllm --attention-backend trtllm_mha  --mamba-ssm-dtype bfloat16

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	8	exact_match	↑	0.9575	±	0.0056
		strict-match	8	exact_match	↑	0.8370	±	0.0102

I also reproduced the Qwen3-30B-A3B NVFP4 accuracy result as it was shown in PR13556

Accuracy: 0.899
Invalid: 0.000
Latency: 29.658 s
Output throughput: 5636.438 token/s

Modifications

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Work with maintainers to merge your PR. See the PR Merge Process

Co-authored-by: Sam Li <lsam@nvidia.com> Co-authored-by: Kaixi Hou <kaixih@nvidia.com>

gemini-code-assist · 2025-11-22T08:33:57Z

Summary of Changes

Hello @samuellees, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for NVFP4 Mixture-of-Experts (MoE) for Qwen series models, optimized for Blackwell GPUs. The changes involve adapting the MoE layer to dynamically configure routing parameters and updating the server's argument parsing to correctly handle modelopt_fp4 quantization. A new unit test has been added to ensure the stability and accuracy of this new configuration, demonstrating its readiness for deployment.

Highlights

NVFP4 MoE Support: Enabled NVFP4 Mixture-of-Experts (MoE) for Qwen series models, specifically targeting Blackwell GPUs, to enhance performance and efficiency.
Dynamic Routing Configuration: Modified the MoE layer to dynamically handle routing methods and correction biases, removing hardcoded values and improving flexibility.
Quantization Backend Integration: Updated server arguments to correctly recognize modelopt_fp4 quantization for automatic MoE runner backend selection, ensuring proper configuration.
New Unit Test and Validation: Introduced a dedicated unit test to validate the NVFP4 MoE functionality for Qwen3-30B-A3B models, including accuracy verification, and integrated it into the nightly test suite.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables NVFP4 MoE for Qwen series models on Blackwell GPUs. The changes involve updating server arguments to recognize modelopt_fp4 quantization for MoE backend selection and generalizing the MoE layer to support different routing methods. A new nightly test is added to verify the functionality and accuracy. The changes appear correct and well-tested. I've identified a minor code duplication issue in server_args.py that could be refactored for better maintainability.

kaixih · 2025-11-23T05:38:23Z

                    self.quantization = quant_method
                if (
-                    self.quantization == "fp8"
+                    (self.quantization == "fp8" or self.quantization == "modelopt_fp4")


nit: maybe self.quantization in ("fp8", "modelopt_fp4")?

…sgl-project#13761 (sgl-project#13761) Co-authored-by: Kaixi Hou <kaixih@nvidia.com>

samuellees and others added 3 commits November 21, 2025 21:45

Enable nvfp4 moe for Qwen series models

f9c080a

Co-authored-by: Sam Li <lsam@nvidia.com> Co-authored-by: Kaixi Hou <kaixih@nvidia.com>

update

c964bbb

Merge branch 'main' into trtllm-moe-nvfp4

48a062d

samuellees requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners November 22, 2025 08:33

gemini-code-assist Bot reviewed Nov 22, 2025

View reviewed changes

samuellees mentioned this pull request Nov 22, 2025

[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13427

Closed

9 tasks

samuellees changed the title ~~Trtllm moe nvfp4~~ [Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13427 Nov 22, 2025

samuellees changed the title ~~[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13427~~ [Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761 Nov 22, 2025

kaixih approved these changes Nov 23, 2025

View reviewed changes

samuellees added 2 commits November 23, 2025 13:54

reformat

6af65dd

Merge branch 'main' into trtllm-moe-nvfp4

35ce20c

b8zhong added the run-ci label Nov 23, 2025

b8zhong approved these changes Nov 23, 2025

View reviewed changes

b8zhong and others added 8 commits November 23, 2025 07:46

Merge branch 'main' into trtllm-moe-nvfp4

21504eb

Merge branch 'main' into trtllm-moe-nvfp4

3c1e02a

Merge branch 'main' into trtllm-moe-nvfp4

7717100

Merge branch 'main' into trtllm-moe-nvfp4

aa74c48

Merge branch 'main' into trtllm-moe-nvfp4

27932cc

Merge branch 'main' into trtllm-moe-nvfp4

3ddbe1a

Merge branch 'main' into trtllm-moe-nvfp4

f843afe

Merge branch 'main' into trtllm-moe-nvfp4

f724ef8

Fridge003 approved these changes Nov 27, 2025

View reviewed changes

Fridge003 merged commit 91e8dc3 into sgl-project:main Nov 27, 2025
108 of 128 checks passed

samuellees deleted the trtllm-moe-nvfp4 branch November 27, 2025 01:24

This was referenced Dec 3, 2025

[FIX] trtllm-moe-fp4-renorm for Qwen series models #14350

Merged

Add data type check for deepseek fp4 moe flashinfer-ai/flashinfer#2165

Merged

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025

[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) …

3cefc7d

…sgl-project#13761 (sgl-project#13761) Co-authored-by: Kaixi Hou <kaixih@nvidia.com>

zhyncs mentioned this pull request Dec 8, 2025

Revert "[Bug] fix not desired disable fused share experts caused by r… #14676

Merged

6 tasks

anurlybayev added blackwell SM100/SM120 nvidia labels Dec 16, 2025

samuellees mentioned this pull request Feb 11, 2026

[Scratch] Qwen3 Next #18591

Closed

12 tasks

hlu1 mentioned this pull request Feb 11, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761#13761

[Feat][NVFP4] Enable NVFP4 MoE for Qwen series models (eg. Qwen3-Next) #13761#13761
Fridge003 merged 13 commits intosgl-project:mainfrom
samuellees:trtllm-moe-nvfp4

samuellees commented Nov 22, 2025

Uh oh!

gemini-code-assist Bot commented Nov 22, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

kaixih Nov 23, 2025

Uh oh!

samuellees Nov 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

samuellees commented Nov 22, 2025

PR Dependency

Motivation

TODO

Accuracy Tests

Modifications

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

kaixih Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

samuellees Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants