[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) by kaixih · Pull Request #22921 · sgl-project/sglang

kaixih · 2026-04-16T03:09:32Z

[GDN] Add FlashInfer prefill support for SM100+ (Blackwell)

Summary

Extends FlashInfer GDN kernel support to cover the prefill/extend path on SM100+
(Blackwell) hardware, previously raising NotImplementedError. SM90 (Hopper)
prefill was already supported; this PR completes SM100+ coverage.

Accuracy (Qwen3.5-397B-A17B-NVFP4, B200)

gsm8k (200 examples, baseline threshold: 0.95)

Backend	Score
Triton (prefill + decode)	0.985
FlashInfer (prefill + decode)	0.985

GPQA diamond (198 examples, repeat=8, temperature=0.6)

Backend	Scores	Mean
FlashInfer (prefill + decode)	0.848, 0.879, 0.904, 0.879, 0.848, 0.864, 0.869, 0.869	0.870

Throughput Benchmark (B200, Qwen3.5-397B-A17B-NVFP4, TP=8)

More detailed perf numbers in the PR comments below.

Server settings:

--tp-size 8 --max-running-requests 256 --chunked-prefill-size 163840
--mamba-ssm-dtype bfloat16 --mamba-scheduler-strategy no_buffer --mamba-track-interval 128
--attention-backend trtllm_mha --linear-attn-decode-backend flashinfer
--linear-attn-prefill-backend <triton|flashinfer> (varied per run)
--disable-radix-cache --quantization modelopt_fp4

Benchmark settings:

--dataset-name random --random-input-len 8192 --random-output-len 128
--max-concurrency 256 --num-prompts 512

Metric	Triton prefill	FlashInfer prefill	Speedup
Benchmark duration (s)	53.27	50.87	1.05x
Input throughput (tok/s)	78,734	82,445	1.05x
Total throughput (tok/s)	79,964	83,733	1.05x
Mean TTFT (ms)	12,742	12,042	1.06x
Mean TPOT (ms)	109.08	105.14	1.04x

Requirements

FlashInfer >= 0.6.8 (for chunk_gated_delta_rule SM100 path)
nvidia-cutlass-dsl[cu13] >= 4.4.2 (SM100+ only)
CUDA 13 (SM100+ path requires _cuda_major >= 13)

CI States

Latest PR Test (Base): ✅ Run #26271103993
Latest PR Test (Extra): ❌ Run #26271103852

gemini-code-assist · 2026-04-16T03:09:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kaixih · 2026-04-16T03:10:56Z

cc @hlu1 @YAMY1234 @wenscarl

kaixih · 2026-04-16T18:32:19Z

The model has a repeated block pattern of 3× linear attention (GDN) + 1× full attention.
Profiling one such block during prefill:

Backend	Block wall time	GDN prefill (3 layers)	GDN per layer	Kernels/layer
Triton	12,784 µs	1,518 µs (506×3)	506 µs	12
FlashInfer	12,379 µs	1,275 µs (425×3)	425 µs	11
Speedup	1.03x	1.19x	1.19x

The GDN kernel itself is ~19% faster with FlashInfer; the modest system-level gain (~5%)
reflects that GDN is a small fraction of the total forward pass (MoE GEMM, attention,
all-reduce account for the rest).

FlashInfer GDN prefill — kernel breakdown (per layer, 11 launches)

Kernel	Calls	Time
`GatedDeltaNetChunkedKernel` (fused main)	1	328.2 µs
`elementwise_kernel` (bf16 contiguity copy, packed QKV)	3	58.2 µs (19.4 µs each)
`l2norm_fwd_kernel`	2	7.5 µs (3.7 µs each)
`index_elementwise_kernel` (index_copy scatter)	1	2.9 µs
`vectorized_gather_kernel` (state gather)	1	2.5 µs
`vectorized_elementwise_kernel` (exp)	1	2.4 µs
`unrolled_elementwise_kernel` (int64 cast for index_copy)	1	2.2 µs
`vectorized_elementwise_kernel` (clamp)	1	2.0 µs
Total	11	≈406 µs (wall: 425 µs)

Triton GDN prefill — kernel breakdown (per layer, 12 launches)

Kernel	Calls	Time
`chunk_gated_delta_rule_fwd_kernel_h_blockdim64` (main recurrence)	1	257.9 µs
`chunk_fwd_kernel_o` (output projection)	1	63.5 µs
`elementwise_kernel` (bf16 contiguity copy, packed QKV)	3	56.8 µs (18.9 µs each)
`chunk_gated_delta_rule_fwd_kkt_solve_kernel`	1	42.2 µs
`recompute_w_u_fwd_kernel`	1	34.2 µs
`vectorized_elementwise_kernel` (fill bf16)	2	15.6 µs (7.8 µs each)
`l2norm_fwd_kernel`	2	9.0 µs (4.5 µs each)
`chunk_local_cumsum_scalar_kernel`	1	4.8 µs
Total	12	≈484 µs (wall: 506 µs)

The ~80 µs gap between summed kernel times and wall time reflects Python-level kernel
launch overhead (gaps between dispatches). The FlashInfer overhead items above
(packed QKV copies, gather/scatter, l2norm, exp, cast, clamp — ~78 µs) are candidates
for elimination via the upstream improvements listed above.

kaixih · 2026-04-16T18:37:26Z

This PR is ready for review.

hlu1 · 2026-04-16T20:44:56Z

The CuteDSL kernel performance is limited by low parallelism when batch size and number of heads are small, which is clearly shown by the kernel benchmark in flashinfer-ai/flashinfer#3001

Depending on how the prefill benchmark is configured, the e2e speedup will vary a lot. For example, for 1k or 8k ISL and --chunked-prefill-size 163840, and TP4, you get effect batch size 160 and 20 and will hit the higher end of the speedup. But if you set --chunked-prefill-size 8192, the effective batch size will be smaller and will hit the lower end of the speedup. In practice, the real speedup will depend on the real ISL of the workloads, and we likely won't see much speedup for the long ISL workloads.

hlu1 · 2026-04-21T22:07:23Z

        q_fi = l2norm_fwd(q[0].contiguous())
        k_fi = l2norm_fwd(k[0].contiguous())


We can modify the triton l2norm_fwd kernel to make it support strided inputs to eliminate the contiguous calls

ispobock · 2026-04-22T16:20:35Z

/tag-and-rerun-ci

yuan-luo · 2026-04-22T16:38:08Z

/rerun-failed-ci

yuan-luo · 2026-04-23T14:58:06Z

/rerun-failed-ci

yuan-luo · 2026-04-27T17:40:26Z

/rerun-failed-ci

…fy on SM100+ (Blackwell) Resolved conflicts with PR sgl-project#22921: - gdn_flashinfer.py: combined module and class docstrings to reflect that SM100+ now supports decode, prefill, and MTP verify. - gdn_flashinfer.py target_verify: dropped the SM100+ NotImplementedError guard so the pool-API MTP path runs on both SM90 and SM100+. - server_args.py: kept the bf16 gate from sgl-project#22921 and removed the speculative_algorithm gate now that MTP verify is supported on SM100+.

PR sgl-project#22921 renamed the SM-gating attribute from use_state_pool to is_sm100plus (updating all existing call sites). PR sgl-project#23273 was authored against the older name and added a new reference in the bf16 MTP adapter setup. The git auto-merge picked up sgl-project#22921's renames and sgl-project#23273's new block, leaving a single dangling use_state_pool access that crashed at FlashInferGDNKernel.__init__. Rename the one remaining reference to is_sm100plus to match the rest of the class.

kaixih · 2026-05-07T05:00:42Z

ping @yizhang2077

samuellees · 2026-05-11T12:30:16Z

/rerun-failed-ci +

yuan-luo · 2026-05-13T02:55:37Z

/rerun-failed-ci

nvpohanh · 2026-05-20T06:05:13Z

@kaixih Please rebase and resolve the conflicts. thanks!

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nvpohanh · 2026-05-22T01:37:03Z

/rerun-failed-ci

nvpohanh · 2026-05-22T01:52:22Z

Traceback (most recent call last):
  File "/actions-runner/_work/sglang/sglang/test/run_suite.py", line 421, in <module>
    main()
  File "/actions-runner/_work/sglang/sglang/test/run_suite.py", line 416, in main
    exit_code = run_a_suite(args)
  File "/actions-runner/_work/sglang/sglang/test/run_suite.py", line 295, in run_a_suite
    validate_all_suites(all_tests)
  File "/actions-runner/_work/sglang/sglang/test/run_suite.py", line 171, in validate_all_suites
    raise ValueError("Tests registered to invalid suites:\n" + "\n".join(errors))
ValueError: Tests registered to invalid suites:
  /actions-runner/_work/sglang/sglang/test/registered/4-gpu-models/test_qwen35_fp4_flashinfer.py: backend=CUDA, suite='stage-c-test-4-gpu-b200'

https://github.com/sgl-project/sglang/actions/runs/26260336698/job/77301401879?pr=22921
@kaixih Could you fix this?

kaixih · 2026-05-22T16:45:05Z

@nvpohanh Thanks for flagging this. The CI registration was using an invalid CUDA suite name, so I updated it to stage="base-c", runner_config="4-gpu-b200", which resolves to base-c-test-4-gpu-b200.

I also manually ran the target test on a 4x B200 node in the latest SGLang dev container; it passed with GSM8K accuracy 0.980 vs the 0.950 baseline. The new red checks look unrelated: the other B200 shards were fast-failed due to a root failure in base-c-test-4-gpu-h100 (0).

nvpohanh · 2026-05-25T01:02:54Z

/tag-and-rerun-ci

nvpohanh · 2026-05-26T04:47:08Z

All the NV CI has hassed. @yuan-luo @Fridge003 could we merge this? Thanks!

nvpohanh · 2026-05-27T00:31:53Z

@ispobock Could you also help to review this GDN PR? Thanks!

…gl-project#22921)

kaixih requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ishandhanani, ispobock, merrymercy and yctseng0211 as code owners April 16, 2026 03:09

kaixih changed the title ~~[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell)~~ [Draft] [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) Apr 16, 2026

kaixih changed the title ~~[Draft] [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell)~~ [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) Apr 16, 2026

hlu1 approved these changes Apr 16, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/linear/kernels/gdn_flashinfer.py Outdated

Comment thread python/sglang/srt/server_args.py Outdated

hlu1 reviewed Apr 16, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/linear/kernels/gdn_flashinfer.py

nvpohanh mentioned this pull request Apr 17, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

kaixih closed this Apr 17, 2026

kaixih reopened this Apr 17, 2026

kaixih force-pushed the add_flashinfer_gdn_prefill branch from 23b04c0 to b6c0d39 Compare April 17, 2026 18:00

hlu1 reviewed Apr 21, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 22, 2026

yuan-luo reviewed Apr 23, 2026

View reviewed changes

Comment thread python/sglang/srt/server_args.py

yuan-luo approved these changes Apr 23, 2026

View reviewed changes

arpera mentioned this pull request Apr 23, 2026

[GDN] Enable FI Blackwell GDN prefill kernel vllm-project/vllm#40717

Merged

4 tasks

kaixih mentioned this pull request Apr 27, 2026

[feat] add log gate and initial state pool support in blackwell gdn prefill flashinfer-ai/flashinfer#3167

Open

5 tasks

Kangyan-Zhou mentioned this pull request Apr 28, 2026

ci: clean up stale-CUDA mooncake variant in install_extra_deps #23960

Merged

2 tasks

kaixih force-pushed the add_flashinfer_gdn_prefill branch 2 times, most recently from 032c24c to 9496b9d Compare May 12, 2026 21:17

kaixih and others added 3 commits May 21, 2026 23:59

Add flashinfer GDN prefill

fb18035

chore: clarify padding-index clamp comment and clean up stale TODO

e8be1fe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix FlashInfer GDN SM100 prefill

241bc2c

kaixih force-pushed the add_flashinfer_gdn_prefill branch from 9496b9d to 241bc2c Compare May 22, 2026 00:02

Change CI test tag

f312314

Fridge003 approved these changes May 27, 2026

View reviewed changes

Fridge003 merged commit ddf0627 into sgl-project:main May 27, 2026
227 of 254 checks passed

mqhc2020 pushed a commit to mqhc2020/sglang that referenced this pull request Jun 2, 2026

[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) (s…

880cf09

…gl-project#22921)

alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026

[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) (s…

2258561

…gl-project#22921)

jeynmann pushed a commit to jeynmann/sglang that referenced this pull request Jun 4, 2026

[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) (s…

033fac4

…gl-project#22921)

		q_fi = l2norm_fwd(q[0].contiguous())
		k_fi = l2norm_fwd(k[0].contiguous())

Conversation

kaixih commented Apr 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[GDN] Add FlashInfer prefill support for SM100+ (Blackwell)

Summary

Accuracy (Qwen3.5-397B-A17B-NVFP4, B200)

Throughput Benchmark (B200, Qwen3.5-397B-A17B-NVFP4, TP=8)

Requirements

CI States

Uh oh!

gemini-code-assist Bot commented Apr 16, 2026

Uh oh!

kaixih commented Apr 16, 2026

Uh oh!

kaixih commented Apr 16, 2026

FlashInfer GDN prefill — kernel breakdown (per layer, 11 launches)

Triton GDN prefill — kernel breakdown (per layer, 12 launches)

Uh oh!

kaixih commented Apr 16, 2026

Uh oh!

hlu1 commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlu1 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock commented Apr 22, 2026

Uh oh!

yuan-luo commented Apr 22, 2026

Uh oh!

Uh oh!

yuan-luo commented Apr 23, 2026

Uh oh!

yuan-luo commented Apr 27, 2026

Uh oh!

kaixih commented May 7, 2026

Uh oh!

samuellees commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented May 13, 2026

Uh oh!

nvpohanh commented May 20, 2026

Uh oh!

nvpohanh commented May 22, 2026

Uh oh!

nvpohanh commented May 22, 2026

Uh oh!

kaixih commented May 22, 2026

Uh oh!

nvpohanh commented May 25, 2026

Uh oh!

nvpohanh commented May 26, 2026

Uh oh!

nvpohanh commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kaixih commented Apr 16, 2026 •

edited by github-actions Bot

Loading

samuellees commented May 11, 2026 •

edited

Loading