[nvidia] Gemma4 nvfp4 fix#22079
Merged
ispobock merged 6 commits intosgl-project:mainfrom Apr 10, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2 tasks
0fc27e8 to
9d27ff0
Compare
9d27ff0 to
be6a1d0
Compare
… large head dims and default to trtllm_mha on sm100.
be6a1d0 to
6f8beef
Compare
kpham-sgl
approved these changes
Apr 7, 2026
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci again |
ispobock
approved these changes
Apr 8, 2026
alexnails
reviewed
Apr 8, 2026
alexnails
approved these changes
Apr 8, 2026
Collaborator
|
/rerun-failed-ci again |
Collaborator
|
/rerun-failed-ci one |
|
Any reason this isn't handling sm_120a (RTX 6000) |
Fridge003
pushed a commit
that referenced
this pull request
Apr 11, 2026
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
|
hey did you guys test this with the docker images because it does not work for: I'm getting: |
Collaborator
|
@baoskee that container is quite old. could you try the latest |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Based on #21952 and depends on flashinfer-ai/flashinfer#2959
Motivation
Gemma 4 NVFP4 checkpoints does not work on GB200 for the following reasons:
Triton attention kernel — PTX register exhaustion
When running Gemma4 with the triton attention backend on GB200, the engine crashes during prefill:
Root cause:
_get_block_sizes_for_extend_attentionhad no dedicated branch forCUDA_CAPABILITY[0] == 10(GB200/B200/sm_100a).sm_100afell into the >= 9 Hopper catch-all, selectingBLOCK_M=32, BLOCK_N=64, num_warps=8for Lq > 256. Gemma4 uses a global head dim of 512, so this config is always hit for global attention layers.The crash is specifically triggered when the KV cache dtype is fp8 — which Gemma4-NVFP4 enables automatically via
quant_config.kv_cache_quant_algo = "FP8". Thefp8dequantization instructions in the kernel body increase register pressure enough to push oversm_100a'sptxas allocation limit. The same crash reproduces with any bf16 model that explicitly setskv_cache_dtype=fp8_e4m3on GB200.Modifications
In
extend_attention.py: Add a dedicatedCUDA_CAPABILITY[0] == 10branch before the >= 9 Hopper catch-all with smaller tile sizes(BLOCK_M=16, BLOCK_N=64 for Lq > 256)to stay within thesm_100aregister budget.Accuracy Tests
Tested on GB200 with nvidia/Gemma-4-31B-IT-NVFP4 + triton attention backend. Script completes without exception and produces correct output.
Speed Tests and Profiling
cc. @nvpohanh
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci