[NVIDIA] [GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell)#23273
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
mmangkad
added a commit
to mmangkad/sglang
that referenced
this pull request
Apr 28, 2026
…fy on SM100+ (Blackwell) Resolved conflicts with PR sgl-project#22921: - gdn_flashinfer.py: combined module and class docstrings to reflect that SM100+ now supports decode, prefill, and MTP verify. - gdn_flashinfer.py target_verify: dropped the SM100+ NotImplementedError guard so the pool-API MTP path runs on both SM90 and SM100+. - server_args.py: kept the bf16 gate from sgl-project#22921 and removed the speculative_algorithm gate now that MTP verify is supported on SM100+.
mmangkad
added a commit
to mmangkad/sglang
that referenced
this pull request
Apr 28, 2026
…fy on SM100+ (Blackwell) Resolved conflicts with PR sgl-project#22921: - gdn_flashinfer.py: combined module and class docstrings to reflect that SM100+ now supports decode, prefill, and MTP verify. - gdn_flashinfer.py target_verify: dropped the SM100+ NotImplementedError guard so the pool-API MTP path runs on both SM90 and SM100+. - server_args.py: kept the bf16 gate from sgl-project#22921 and removed the speculative_algorithm gate now that MTP verify is supported on SM100+.
mmangkad
added a commit
to mmangkad/sglang
that referenced
this pull request
Apr 28, 2026
PR sgl-project#22921 renamed the SM-gating attribute from use_state_pool to is_sm100plus (updating all existing call sites). PR sgl-project#23273 was authored against the older name and added a new reference in the bf16 MTP adapter setup. The git auto-merge picked up sgl-project#22921's renames and sgl-project#23273's new block, leaving a single dangling use_state_pool access that crashed at FlashInferGDNKernel.__init__. Rename the one remaining reference to is_sm100plus to match the rest of the class.
willhu-jpg
added a commit
to modal-labs/sglang
that referenced
this pull request
May 15, 2026
…fy on SM100+ (Blackwell)
Collaborator
|
@wenscarl Could you rebase so that I can trigger CI? thanks |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
https://github.com/sgl-project/sglang/actions/runs/26224310402/job/77302047639?pr=23273 Seeing this repeatedly 🤔 |
Collaborator
This has been fixed by #25958 @wenscarl let's merge the latest main. Thanks |
Fridge003
reviewed
May 27, 2026
YAMY1234
added a commit
to wenscarl/sglang
that referenced
this pull request
May 29, 2026
…backend Locks down the existing Qwen3.5 NVFP4 MTP test to Triton backend so the Triton coverage is preserved after this PR removes the `speculative_algorithm is None` guard from the SM100+ FlashInfer auto-default, and adds a parallel test class that explicitly exercises the new FlashInfer GDN MTP verify path. Addresses reviewer comment on PR sgl-project#23273. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…backend Locks down the existing Qwen3.5 NVFP4 MTP test to Triton backend so the Triton coverage is preserved after this PR removes the `speculative_algorithm is None` guard from the SM100+ FlashInfer auto-default, and adds a parallel test class that explicitly exercises the new FlashInfer GDN MTP verify path. Addresses reviewer comment on PR sgl-project#23273. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
/rerun-failed-ci |
1 similar comment
Collaborator
|
/rerun-failed-ci |
Collaborator
|
H20 failure is a known issue fixed by #26883 @Fridge003 could we merge this? |
Fridge003
approved these changes
Jun 2, 2026
xjpang
pushed a commit
to xjpang/sglang
that referenced
this pull request
Jun 2, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mqhc2020
pushed a commit
to mqhc2020/sglang
that referenced
this pull request
Jun 2, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hanming-lu
pushed a commit
that referenced
this pull request
Jun 3, 2026
…3273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
willhu-jpg
pushed a commit
to modal-projects/sglang
that referenced
this pull request
Jun 3, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alphabetc1
pushed a commit
to alphabetc1/sglang
that referenced
this pull request
Jun 4, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeynmann
pushed a commit
to jeynmann/sglang
that referenced
this pull request
Jun 4, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
edwingao28
pushed a commit
to edwingao28/sglang
that referenced
this pull request
Jun 7, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
monkeyLoveding
pushed a commit
to monkeyLoveding/sglang_open
that referenced
this pull request
Jun 9, 2026
…l-project#23273) Co-authored-by: Yangmin Li <yangminl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[GDN] Enable FlashInfer MTP verify on SM100+ (Blackwell)
co-authored by @YAMY1234 (main contributor)
Summary
Enables FlashInfer GDN MTP (speculative decoding) verify on SM100+ (Blackwell) hardware, previously raising
NotImplementedError. SM90 (Hopper) MTP was already supported; this PR completes SM100+ coverage.Root cause:
target_verifyguarded onuse_state_pool, blocking SM100+ even though the FlashInfergated_delta_rule_mtpkernel already acceptsinitial_state_indices(pool API) — the same API used by the SM90 path.Changes (2 files, ~15 lines):
gdn_flashinfer.py: removeuse_state_poolguard intarget_verify; unify SM90 + SM100+ into a single pool-API path; addA_log.detach().float()cast (matches SM100+ decode path, no-op on SM90).server_args.py: removeand self.speculative_algorithm is Nonefrom the SM100+ FlashInfer auto-default — FlashInfer is now safe to default on SM100+ regardless of whether MTP is enabled.Accuracy (Qwen3.5-397B-A17B-NVFP4, B200)
gsm8k (TODO: examples, baseline threshold: 0.95)
GPQA Diamond (TODO: examples, repeat=8, temperature=0.6)
and
Throughput Benchmark (GB200, Qwen3.5-397B-A17B-NVFP4, TP=4)
Focus: long output sequence length (OSL), where per-step GDN state-update cost is most significant.
Server settings:
--tp-size 4 --max-running-requests 128--mamba-ssm-dtype bfloat16 --mamba-scheduler-strategy no_buffer --mamba-track-interval 128--attention-backend trtllm_mha --linear-attn-decode-backend <triton|flashinfer>--speculative-algorithm NEXTN(MTP runs)--disable-radix-cache --quantization modelopt_fp4Benchmark settings:
--dataset-name random --random-input-len 32 --random-output-len <512|1024|2048|4096>--num-prompts <varied> --request-rate infDecode throughput (w/ MTP), output throughput( tok/s) — ISL=32
acc len: 3.13-3.22
num_prompts: 256
Mean TPOT (ms/tok), ISL=32, OSL=512
with flashinfer-ai/flashinfer#3147 and
Requirements
The traces are collected at ISL: 32 OSL: 512, CC: 64

Flashinfer:
triton:

CI States
Latest PR Test (Base): ❌ Run #26703045174
Latest PR Test (Extra): ❌ Run #26703045140