[Model] Support Nemotron 3 Super NVFP4 by mmangkad · Pull Request #20407 · sgl-project/sglang

mmangkad · 2026-03-12T04:44:09Z

Summary

Support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 via modelopt_mixed

Fix #20472

Accuracy Tests

Without MTP

sglang serve --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --tensor-parallel-size 2 --reasoning-parser nemotron_3 --tool-call-parser qwen3_coder --trust-remote-code --disable-radix-cache

python -m sglang.test.run_eval --base-url http://localhost:30000 --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --eval-name gsm8k --num-examples 1319 --repeat 1 --num-threads 512 --num-shots 5 --max-tokens 16000 --temperature 1.0 --top-p 0.95 --chat-template-kwargs '{"enable_thinking": true}'

ChatCompletionSampler initialized with self.system_message=None self.temperature=1.0 self.max_tokens=16000 self.reasoning_effort=None self.extra_body={'chat_template_kwargs': {'enable_thinking': True}}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1314/1314 [03:03<00:00,  7.15it/s]
Total latency: 183.893 s
Score: 0.969
[METRIC] gsm8k_score=0.9687975646879756 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
[METRIC] gsm8k_latency=183.893038809 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
Writing report to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.html
{'score:std': np.float64(0.17386443955744169), 'score': np.float64(0.9687975646879756)}
Writing results to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.json

With MTP

sglang serve --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --tensor-parallel-size 2 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --reasoning-parser nemotron_3 --tool-call-parser qwen3_coder --trust-remote-code --disable-radix-cache

python -m sglang.test.run_eval --base-url http://localhost:30000 --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --eval-name gsm8k --num-examples 1319 --repeat 1 --num-threads 48 --num-shots 5 --max-tokens 16000 --temperature 1.0 --top-p 0.95 --chat-template-kwargs '{"enable_thinking": true}'

ChatCompletionSampler initialized with self.system_message=None self.temperature=1.0 self.max_tokens=16000 self.reasoning_effort=None self.extra_body={'chat_template_kwargs': {'enable_thinking': True}}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1314/1314 [05:08<00:00,  4.25it/s]
Total latency: 308.988 s
Score: 0.971
[METRIC] gsm8k_score=0.9710806697108066 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
[METRIC] gsm8k_latency=308.98772795 labels={"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", "eval": "gsm8k"}
Writing report to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.html
{'score:std': np.float64(0.1675798395536225), 'score': np.float64(0.9710806697108066)}
Writing results to /tmp/gsm8k_nvidia_NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4.json

gemini-code-assist · 2026-03-12T04:44:14Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

mmangkad · 2026-03-12T04:49:08Z

/rerun-failed-ci

b8zhong

QQ: by w4afp8 do we mean nvfp4 x fp8, because usually it mean w in int4 and activation in fp8, so it would be good to clarify

mmangkad · 2026-03-12T16:12:42Z

QQ: by w4afp8 do we mean nvfp4 x fp8, because usually it mean w in int4 and activation in fp8, so it would be good to clarify

Existing w4afp8 is int4 weights + fp8 act, modelopt_mixed is per-layer mixed (fp4 weights + fp4 act or fp8 weights + fp8 act)

nvpohanh · 2026-03-13T01:42:37Z

This fixes #20472

puppetm4st3r · 2026-03-15T20:09:37Z

nemotron_3

HI, need to understand somethingm why with MTP you have a resulting of almost ~1.5x on latency:
Total latency: 183.893 s
vs
Total latency: 308.988 s
Regars!

upd

bef9e2c

mmangkad requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw, b8zhong and ch-wan as code owners March 12, 2026 04:44

github-actions Bot added the quant LLM Quantization label Mar 12, 2026

github-actions Bot added the run-ci label Mar 12, 2026

b8zhong approved these changes Mar 12, 2026

View reviewed changes

b8zhong enabled auto-merge (squash) March 12, 2026 16:21

nvpohanh mentioned this pull request Mar 13, 2026

[Bug] sm12X NVIDIA-Nemotron-3-Super-NVFP4,- KeyError: 'model.layers.65.mixer.experts.w2_weight_scale' #20472

Closed

5 tasks

stewtong mentioned this pull request Mar 13, 2026

HiMambaRadixCache missing token_to_kv_pool_host — crashes on first prefill #20495

Closed

Merge branch 'main' into support-nemotron-3-super-nvfp4

d5b04e1

stewtong mentioned this pull request Mar 13, 2026

[Benchmark] Nemotron 3 Super 120B-FP8 on 8x RTX PRO 6000 (SM120): 3,215 tok/s burst, 150ms TTFT, 17.9M token KV capacity #20541

Open

7 tasks

Fridge003 approved these changes Mar 14, 2026

View reviewed changes

Fridge003 disabled auto-merge March 14, 2026 07:56

Fridge003 merged commit 75a7879 into sgl-project:main Mar 14, 2026
327 of 361 checks passed

mratsim mentioned this pull request Mar 14, 2026

[Feature] Support Mixed-Precision models #16276

Closed

2 tasks

mmangkad deleted the support-nemotron-3-super-nvfp4 branch March 14, 2026 10:52

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Mar 15, 2026

[Model] Support Nemotron 3 Super NVFP4 (sgl-project#20407)

c403f32

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Model] Support Nemotron 3 Super NVFP4 (sgl-project#20407)

5a8e8b3

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[Model] Support Nemotron 3 Super NVFP4 (sgl-project#20407)

a24df1e

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Model] Support Nemotron 3 Super NVFP4 (sgl-project#20407)

14d8a2c

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Model] Support Nemotron 3 Super NVFP4 (sgl-project#20407)

f7db826

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Support Nemotron 3 Super NVFP4#20407

[Model] Support Nemotron 3 Super NVFP4#20407
Fridge003 merged 2 commits intosgl-project:mainfrom
mmangkad-dev:support-nemotron-3-super-nvfp4

mmangkad commented Mar 12, 2026 •

edited by b8zhong

Loading

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

mmangkad commented Mar 12, 2026 •

edited

Loading

Uh oh!

b8zhong left a comment

Uh oh!

mmangkad commented Mar 12, 2026

Uh oh!

nvpohanh commented Mar 13, 2026

Uh oh!

Uh oh!

puppetm4st3r commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mmangkad commented Mar 12, 2026 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Accuracy Tests

Without MTP

With MTP

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

mmangkad commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8zhong left a comment

Choose a reason for hiding this comment

Uh oh!

mmangkad commented Mar 12, 2026

Uh oh!

nvpohanh commented Mar 13, 2026

Uh oh!

Uh oh!

puppetm4st3r commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mmangkad commented Mar 12, 2026 •

edited by b8zhong

Loading

mmangkad commented Mar 12, 2026 •

edited

Loading