Support piecewise cuda graph for Qwen3-next by Chen-0210 · Pull Request #13081 · sgl-project/sglang

Chen-0210 · 2025-11-11T13:42:41Z

Motivation

Support piecewise cuda graph for Qwen3-next
#11490

Modifications

Temporarily split the entire GDN attention due to too many parameters in the linear-attention function and some parts in linear attenion is incompatible with cuda graph. The performance is not ideal at the moment, but it can be refactored and optimized later.
Disable dual_stream when piecewise CUDA graph is enabled.
Changed to zeros_like instead of empty_like in GDN.
Because q/k/v.shape[0] is larger than the real seq_len, GDN does not write the padded rows, and empty_like leaves them as NaN. Passing such padded QKV to FlashInfer prefill is undefined input, meaning its behavior is not expected... It seems load the NaNs in the padding and make the valid tokens become NaN as well. While fa3 backend doesn’t have this issue.

Accuracy Tests

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct/ --tp 2 --enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager 
python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319
100%|████████████████████████████████████████| 1319/1319 [01:11<00:00, 18.50it/s]
Accuracy: 0.942
Invalid: 0.000
Latency: 71.401 s
Output throughput: 3114.390 token/s

Benchmarking and Profiling(TTFT on bs=1 and H200x2)

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input-len 4096 --random-output-len 1 --random-range-ratio 1 --max-concurrency 1

Input length	1024	2048	4096
with piecewise-cudagraph	67.83 ms	68.72 ms	105.29 ms
without piecewise-cudagraph	99.17 ms	101.52 ms	127.20 ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.

gemini-code-assist · 2025-11-11T13:43:04Z

Summary of Changes

Hello @Chen-0210, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces initial support for piecewise CUDA graphs for the Qwen3-next model. By enabling specific attention and gated delta rule operations to be compiled into CUDA graphs, the changes aim to optimize the model's execution performance. The modifications involve extending the graph compilation backend, defining custom operations for efficient tensor handling, and adapting the Qwen3-next model's forward pass to integrate these graph-based optimizations.

Highlights

Piecewise CUDA Graph Integration for Qwen3-next: Extended the graph splitting mechanism to include 'sglang.gdn_with_output' for Qwen3-next, enabling its attention operations to be compiled into piecewise CUDA graphs.
Custom Gated Delta Rule (GDN) Operation: Introduced 'gdn_with_output' and 'chunk_gated_delta_rule_with_output' as custom operations, allowing the Gated Delta Rule and chunked attention outputs to be handled efficiently within the CUDA graph context by copying results into preallocated tensors.
Enhanced Attention Layer Detection: Improved the model runner's ability to identify various attention layer types ('attn', 'linear_attn') within models, ensuring broader compatibility for piecewise CUDA graph application.
Refactored Qwen3-next Forward Pass: Modified the Qwen3-next attention and main model forward passes to conditionally leverage the new piecewise CUDA graph capabilities, including adjusting stream thresholds and context management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for piecewise CUDA graphs for Qwen3-next models, which involves refactoring attention mechanisms and integrating custom operations. Key changes include modifying graph splitting logic to include sglang.gdn_with_output, refactoring Qwen3GatedDeltaNet's forward pass, and adding new custom operations for gated delta rule and GDN with output. The review identified a critical syntax error, potential performance implications from disabling dual-stream optimization, and some minor code cleanup opportunities.

Oasis-Git

Leave the comment for revision before merge

Oasis-Git · 2025-11-21T06:13:41Z

LGTM. Will approve it after testing on my side.

ispobock · 2025-11-24T09:16:35Z

/tag-and-rerun-ci

yizhang2077

LGTM

This fixes the 0% accuracy issue on H100 with TestQwen3NextPiecewiseCudaGraph. The issue is that during piecewise CUDA graph execution, padded rows may not be written by the FLA kernels, leaving uninitialized garbage values that corrupt downstream computations. Changes: - fused_recurrent.py: Use new_zeros instead of new_empty for output tensor - fused_sigmoid_gating_recurrent.py: Use new_zeros instead of new_empty - qwen3_next.py: Use zeros_like instead of empty_like for output tensor This is similar to the fix applied in chunk_o.py in PR #13081.

Chen-0210 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and xiezhq-hermann as code owners November 11, 2025 13:42

Chen-0210 marked this pull request as draft November 11, 2025 13:42

gemini-code-assist Bot reviewed Nov 11, 2025

View reviewed changes

hebiao064 assigned hebiao064, ispobock and yizhang2077 Nov 11, 2025

yuan-luo self-requested a review November 12, 2025 01:49

b8zhong mentioned this pull request Nov 13, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

Chen-0210 changed the title ~~[WIP]Support piecewise cuda graph for Qwen3-next~~ Support piecewise cuda graph for Qwen3-next Nov 17, 2025

Chen-0210 marked this pull request as ready for review November 17, 2025 06:29

Chen-0210 force-pushed the support_piece_cuda_graph_Qwen3-next branch from 4a97551 to 1883eab Compare November 17, 2025 09:08

Chen-0210 added 5 commits November 17, 2025 01:11

init

6ec6f1a

fix

3a96c04

fix

7c2e38b

fix accuracy

6863245

fix

7896915

Chen-0210 force-pushed the support_piece_cuda_graph_Qwen3-next branch from 1883eab to 7896915 Compare November 17, 2025 09:11

fix

cde0ed5

Oasis-Git reviewed Nov 18, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/attention/fla/chunk_o.py

Comment thread python/sglang/srt/layers/attention/fla/layernorm_gated.py

Comment thread python/sglang/srt/models/qwen3_next.py

Comment thread python/sglang/srt/models/qwen3_next.py Outdated

fix

fa711a3

Chen-0210 added 2 commits November 21, 2025 16:08

Merge branch 'main' into support_piece_cuda_graph_Qwen3-next

5b29947

add test

109ada6

Oasis-Git added the run-ci label Nov 22, 2025

move test to gpu4

82c1cfe

github-actions Bot added the piecewise-cuda-graph label Nov 24, 2025

Chen-0210 added 2 commits November 25, 2025 10:24

Merge branch 'main' into support_piece_cuda_graph_Qwen3-next

2f88851

fix

06d3a4f

ispobock reviewed Nov 25, 2025

View reviewed changes

Comment thread test/srt/models/test_qwen3_next_models.py

yizhang2077 approved these changes Nov 25, 2025

View reviewed changes

ispobock merged commit d64bf6c into sgl-project:main Nov 25, 2025
105 of 116 checks passed

Chen-0210 deleted the support_piece_cuda_graph_Qwen3-next branch November 25, 2025 13:31

hebiao064 reviewed Nov 25, 2025

View reviewed changes

Comment thread python/sglang/srt/model_executor/model_runner.py

alisonshao mentioned this pull request Dec 2, 2025

Revert PR #14044: Restore separate memory pool for piecewise CUDA graph #14278

Merged

3 tasks

Chen-0210 mentioned this pull request Dec 6, 2025

[Qwen3-Next]Optimize piecewise CUDA graph for Qwen3-Next #14502

Open

2 tasks

Conversation

Chen-0210 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling(TTFT on bs=1 and H200x2)

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Oasis-Git left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Oasis-Git commented Nov 21, 2025

Uh oh!

ispobock commented Nov 24, 2025

Uh oh!

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Chen-0210 commented Nov 11, 2025 •

edited

Loading