[nvidia] Gemma4 nvfp4 fix by wenscarl · Pull Request #22079 · sgl-project/sglang

wenscarl · 2026-04-03T22:14:34Z

Based on #21952 and depends on flashinfer-ai/flashinfer#2959

Motivation

Gemma 4 NVFP4 checkpoints does not work on GB200 for the following reasons:

Triton attention kernel — PTX register exhaustion

When running Gemma4 with the triton attention backend on GB200, the engine crashes during prefill:

triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
ptxas fatal: Register allocation failed with register count of '255'.

Root cause: _get_block_sizes_for_extend_attention had no dedicated branch for CUDA_CAPABILITY[0] == 10 (GB200/B200/sm_100a). sm_100a fell into the >= 9 Hopper catch-all, selecting BLOCK_M=32, BLOCK_N=64, num_warps=8 for Lq > 256. Gemma4 uses a global head dim of 512, so this config is always hit for global attention layers.

The crash is specifically triggered when the KV cache dtype is fp8 — which Gemma4-NVFP4 enables automatically via quant_config.kv_cache_quant_algo = "FP8". The fp8 dequantization instructions in the kernel body increase register pressure enough to push over sm_100a's ptxas allocation limit. The same crash reproduces with any bf16 model that explicitly sets kv_cache_dtype=fp8_e4m3 on GB200.

Modifications

In extend_attention.py: Add a dedicated CUDA_CAPABILITY[0] == 10 branch before the >= 9 Hopper catch-all with smaller tile sizes (BLOCK_M=16, BLOCK_N=64 for Lq > 256) to stay within the sm_100a register budget.

Accuracy Tests

Tested on GB200 with nvidia/Gemma-4-31B-IT-NVFP4 + triton attention backend. Script completes without exception and produces correct output.

Speed Tests and Profiling

cc. @nvpohanh

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-03T22:14:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

… large head dims and default to trtllm_mha on sm100.

kpham-sgl · 2026-04-07T18:52:18Z

/tag-and-rerun-ci

kpham-sgl · 2026-04-08T01:03:55Z

/tag-and-rerun-ci again

kpham-sgl · 2026-04-08T05:56:26Z

/rerun-failed-ci again

kpham-sgl · 2026-04-09T02:30:44Z

/rerun-failed-ci one

jeremylea · 2026-04-09T17:47:37Z

Any reason this isn't handling sm_120a (RTX 6000)

baoskee · 2026-05-07T06:20:23Z

hey did you guys test this with the docker images because it does not work for:

docker pull lmsysorg/sglang:cu13-gemma4 # CUDA 13

I'm getting:

File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 115, in forward return self.forward_extend( ^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 936, in forward_extend self.extend_attention_fwd( File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/extend_attention.py", line 609, in extend_attention_fwd _fwd_kernel[grid]( File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 419, in <lambda> return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 733, in run kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 861, in _do_compile kernel = self.compile(src, target=target, options=options.__dict__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 320, in compile next_module = compile_ir(module, metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 520, in <lambda> stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.target.arch) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py", line 503, in make_cubin raise PTXASError(error) triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error ptxas stderr: ptxas fatal : (C7600) Register allocation failed with register count of '255'. Compile the program with a higher register target ptxas fatal : Ptx assembly aborted due to errors Repro command: /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_100a /tmp/tmp8ca_n_mf.ptx -o /tmp/tmp8ca_n_mf.ptx.o

nvpohanh · 2026-05-07T07:33:45Z

@baoskee that container is quite old. could you try the latest dev-cu13 container?

github-actions Bot added quant LLM Quantization Multi-modal multi-modal language model labels Apr 3, 2026

b8zhong mentioned this pull request Apr 5, 2026

[Feature] Gemma-4-31B-IT-NVFP4 #22129

Closed

2 tasks

wenscarl force-pushed the gemma4-nvfp4-fix branch from 0fc27e8 to 9d27ff0 Compare April 6, 2026 16:29

wenscarl marked this pull request as ready for review April 7, 2026 14:04

wenscarl force-pushed the gemma4-nvfp4-fix branch from 9d27ff0 to be6a1d0 Compare April 7, 2026 14:30

fix(triton): add sm100 block sizes to fix PTX register exhaustion for…

6f8beef

… large head dims and default to trtllm_mha on sm100.

wenscarl force-pushed the gemma4-nvfp4-fix branch from be6a1d0 to 6f8beef Compare April 7, 2026 14:34

FIx for fp8 kvcache.

64f4f2c

kpham-sgl approved these changes Apr 7, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 7, 2026

kpham-sgl added high priority and removed run-ci labels Apr 7, 2026

kpham-sgl self-assigned this Apr 7, 2026

Merge branch 'main' into gemma4-nvfp4-fix

c52d24c

kpham-sgl added the run-ci label Apr 8, 2026

ispobock approved these changes Apr 8, 2026

View reviewed changes

alexnails reviewed Apr 8, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/triton_ops/extend_attention.py

alexnails approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'main' into gemma4-nvfp4-fix

e014f7a

Merge branch 'main' into gemma4-nvfp4-fix

062c916

Merge branch 'main' into gemma4-nvfp4-fix

2899de6

ispobock merged commit 5638d40 into sgl-project:main Apr 10, 2026
194 of 233 checks passed

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

[nvidia] Gemma4 nvfp4 fix (#22079)

2e5ff9d

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[nvidia] Gemma4 nvfp4 fix (sgl-project#22079)

1a109a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvidia] Gemma4 nvfp4 fix#22079

[nvidia] Gemma4 nvfp4 fix#22079
ispobock merged 6 commits intosgl-project:mainfrom
wenscarl:gemma4-nvfp4-fix

wenscarl commented Apr 3, 2026 •

edited by kpham-sgl

Loading

Uh oh!

gemini-code-assist Bot commented Apr 3, 2026

Uh oh!

kpham-sgl commented Apr 7, 2026

Uh oh!

kpham-sgl commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

kpham-sgl commented Apr 8, 2026 •

edited

Loading

Uh oh!

kpham-sgl commented Apr 9, 2026 •

edited

Loading

Uh oh!

jeremylea commented Apr 9, 2026

Uh oh!

Uh oh!

baoskee commented May 7, 2026

Uh oh!

nvpohanh commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

wenscarl commented Apr 3, 2026 • edited by kpham-sgl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 3, 2026

Uh oh!

kpham-sgl commented Apr 7, 2026

Uh oh!

kpham-sgl commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kpham-sgl commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpham-sgl commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremylea commented Apr 9, 2026

Uh oh!

Uh oh!

baoskee commented May 7, 2026

Uh oh!

nvpohanh commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wenscarl commented Apr 3, 2026 •

edited by kpham-sgl

Loading

kpham-sgl commented Apr 8, 2026 •

edited

Loading

kpham-sgl commented Apr 8, 2026 •

edited

Loading

kpham-sgl commented Apr 9, 2026 •

edited

Loading