[Qwen3-Next] Fuse Qwen3-Next GDN's qkvz_proj and ba_proj#19321
[Qwen3-Next] Fuse Qwen3-Next GDN's qkvz_proj and ba_proj#19321BBuf merged 3 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance optimization for the Qwen3-Next model by integrating Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR aims to improve performance by fusing qkvz_proj and ba_proj in Qwen3-Next's Gated Delta Net (GDN) using MergedColumnParallelLinear. The changes in python/sglang/srt/layers/linear.py correctly add support for loading fused weights with tuple-based shard IDs. The fusion of ba_proj in python/sglang/srt/models/qwen3_next.py also appears correct. However, there is a critical issue in the implementation of in_proj_qkvz fusion. The create_qkvz_proj method defines a MergedColumnParallelLinear with only two output partitions, which contradicts the weight loading logic in load_weights that expects four partitions. This will lead to an IndexError at runtime. I've provided a comment with a suggested fix.
cfae62a to
7d2f34d
Compare
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
I found ./test/registered/models/test_qwen3_next_models_pcg.py failed in CI due to acc drop. |
1f36219 to
d36e7b5
Compare
290f9fd to
8461701
Compare
ed019b7 to
4b111d2
Compare
|
Problem fixed. With PCG, the result is correct now. |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
zminglei
left a comment
There was a problem hiding this comment.
Thanks Yuan for this nice optimization! Could we also confirm it could still on-par or better than baseline in decode heavy cases?
|
Thanks for the PR! A couple of questions to align my understanding: For Qwen3-Next: The HF checkpoint already stores For Qwen3.5: The HF checkpoint stores weights split as |
|
/tag-and-rerun-ci |
|
@zminglei @yuan-luo I tested with A100 tp = 4 for decode-heavy scenario. Results from main branch and PR are 3 runs averaged (first run served as warmup and excluded). Median TTFT improved 9.2%, Median E2E improved 1.8% Server: Client:
|
|
Tested with GSM8K few shot test with 0.955 accuracy and no regression observed. |
@kaixih Agree with you. This PR's core value is to introduce tuple shard_id for linear framework to make Qwen3.5 fuse 4 GEMM into 2 in the next step possible. |
|
/rerun-failed-ci |
@kaixih I addressed the second point in #21019. Could you please help to review it? Thanks. |
…#19321) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…#19321) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…#19321) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…#19321) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…#19321) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
This PR is to fuse Qwen3-Next GDN's qkvz_proj and ba_proj with MergedColumnParallelLinear in order to improve performance.
TTFT speedup 2.6%. (Stably) E2E throughput increases 2.6%. (Stably in several testing)
We plan to fuse Qwen3.5 GDN's qkvz_proj and ba_proj in the follow up PR.
Modifications
Accuracy Tests
GSM8k no drop:
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci