Skip to content

[Model] Support Nemotron 3 Super NVFP4#20407

Merged
Fridge003 merged 2 commits intosgl-project:mainfrom
mmangkad-dev:support-nemotron-3-super-nvfp4
Mar 14, 2026
Merged

[Model] Support Nemotron 3 Super NVFP4#20407
Fridge003 merged 2 commits intosgl-project:mainfrom
mmangkad-dev:support-nemotron-3-super-nvfp4

Conversation

@mmangkad
Copy link
Copy Markdown
Contributor

@mmangkad mmangkad commented Mar 12, 2026

Summary

Support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 via modelopt_mixed

Fix #20472

Accuracy Tests

Without MTP

sglang serve --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --tensor-parallel-size 2 --reasoning-parser nemotron_3 --tool-call-parser qwen3_coder --trust-remote-code --disable-radix-cache
python -m sglang.test.run_eval --base-url http://localhost:30000 --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --eval-name gsm8k --num-examples 1319 --repeat 1 --num-threads 512 --num-shots 5 --max-tokens 16000 --temperature 1.0 --top-p 0.95 --chat-template-kwargs '{"enable_thinking": true}'

ChatCompletionSampler initialized with self.system_message=None self.temperature=1.0 self.max_tokens=16000 self.reasoning_effort=None self.extra_body={'chat_template_kwargs': {'enable_thinking': True}}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1314/1314 [03:03<00:00,  7.15it/s]
Total latency: 183.893 s
Score: 0.969
[METRIC] gsm8k_score=0.9687975646879756 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
[METRIC] gsm8k_latency=183.893038809 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
Writing report to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.html
{'score:std': np.float64(0.17386443955744169), 'score': np.float64(0.9687975646879756)}
Writing results to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.json

With MTP

sglang serve --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --tensor-parallel-size 2 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --reasoning-parser nemotron_3 --tool-call-parser qwen3_coder --trust-remote-code --disable-radix-cache
python -m sglang.test.run_eval --base-url http://localhost:30000 --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --eval-name gsm8k --num-examples 1319 --repeat 1 --num-threads 48 --num-shots 5 --max-tokens 16000 --temperature 1.0 --top-p 0.95 --chat-template-kwargs '{"enable_thinking": true}'

ChatCompletionSampler initialized with self.system_message=None self.temperature=1.0 self.max_tokens=16000 self.reasoning_effort=None self.extra_body={'chat_template_kwargs': {'enable_thinking': True}}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1314/1314 [05:08<00:00,  4.25it/s]
Total latency: 308.988 s
Score: 0.971
[METRIC] gsm8k_score=0.9710806697108066 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
[METRIC] gsm8k_latency=308.98772795 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
Writing report to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.html
{'score:std': np.float64(0.1675798395536225), 'score': np.float64(0.9710806697108066)}
Writing results to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.json

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the quant LLM Quantization label Mar 12, 2026
@mmangkad
Copy link
Copy Markdown
Contributor Author

mmangkad commented Mar 12, 2026

/rerun-failed-ci

Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: by w4afp8 do we mean nvfp4 x fp8, because usually it mean w in int4 and activation in fp8, so it would be good to clarify

@mmangkad
Copy link
Copy Markdown
Contributor Author

QQ: by w4afp8 do we mean nvfp4 x fp8, because usually it mean w in int4 and activation in fp8, so it would be good to clarify

Existing w4afp8 is int4 weights + fp8 act, modelopt_mixed is per-layer mixed (fp4 weights + fp4 act or fp8 weights + fp8 act)

@nvpohanh
Copy link
Copy Markdown
Collaborator

This fixes #20472

@Fridge003 Fridge003 disabled auto-merge March 14, 2026 07:56
@Fridge003 Fridge003 merged commit 75a7879 into sgl-project:main Mar 14, 2026
327 of 361 checks passed
@mmangkad mmangkad deleted the support-nemotron-3-super-nvfp4 branch March 14, 2026 10:52
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Mar 15, 2026
@puppetm4st3r
Copy link
Copy Markdown

nemotron_3

HI, need to understand somethingm why with MTP you have a resulting of almost ~1.5x on latency:
Total latency: 183.893 s
vs
Total latency: 308.988 s
Regars!

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] sm12X NVIDIA-Nemotron-3-Super-NVFP4,- KeyError: 'model.layers.65.mixer.experts.w2_weight_scale'

5 participants