Support piecewise cuda graph for Qwen3-next#13081
Support piecewise cuda graph for Qwen3-next#13081ispobock merged 12 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Chen-0210, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces initial support for piecewise CUDA graphs for the Qwen3-next model. By enabling specific attention and gated delta rule operations to be compiled into CUDA graphs, the changes aim to optimize the model's execution performance. The modifications involve extending the graph compilation backend, defining custom operations for efficient tensor handling, and adapting the Qwen3-next model's forward pass to integrate these graph-based optimizations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for piecewise CUDA graphs for Qwen3-next models, which involves refactoring attention mechanisms and integrating custom operations. Key changes include modifying graph splitting logic to include sglang.gdn_with_output, refactoring Qwen3GatedDeltaNet's forward pass, and adding new custom operations for gated delta rule and GDN with output. The review identified a critical syntax error, potential performance implications from disabling dual-stream optimization, and some minor code cleanup opportunities.
4a97551 to
1883eab
Compare
1883eab to
7896915
Compare
Oasis-Git
left a comment
There was a problem hiding this comment.
Leave the comment for revision before merge
|
LGTM. Will approve it after testing on my side. |
|
/tag-and-rerun-ci |
This fixes the 0% accuracy issue on H100 with TestQwen3NextPiecewiseCudaGraph. The issue is that during piecewise CUDA graph execution, padded rows may not be written by the FLA kernels, leaving uninitialized garbage values that corrupt downstream computations. Changes: - fused_recurrent.py: Use new_zeros instead of new_empty for output tensor - fused_sigmoid_gating_recurrent.py: Use new_zeros instead of new_empty - qwen3_next.py: Use zeros_like instead of empty_like for output tensor This is similar to the fix applied in chunk_o.py in PR #13081.
Motivation
Support piecewise cuda graph for Qwen3-next
#11490
Modifications
Because
q/k/v.shape[0]is larger than the realseq_len, GDN does not write the padded rows, andempty_likeleaves them asNaN. Passing such padded QKV to FlashInfer prefill is undefined input, meaning its behavior is not expected... It seems load the NaNs in the padding and make the valid tokens become NaN as well. While fa3 backend doesn’t have this issue.Accuracy Tests
Benchmarking and Profiling(TTFT on bs=1 and H200x2)
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 50 --random-input-len 4096 --random-output-len 1 --random-range-ratio 1 --max-concurrency 1Checklist