[PERF] Decouple projections from GDN custom op#27512
[PERF] Decouple projections from GDN custom op#27512simon-mo merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Gated Delta Net (GDN) attention mechanism to improve torch.compile compatibility and performance. By decoupling the input/output projections from the core custom operator and introducing a native PyTorch RMSNormGated layer, the changes yield significant decode throughput improvements. The refactoring is well-executed and the code is clear. I have one high-severity suggestion regarding a local import in a performance-critical path, which should be moved to the top level of the module to adhere to best practices and avoid potential overhead.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
CC @heheda12345 |
|
@ALL |
|
This pull request has merge conflicts that must be resolved before it can be |
ProExpertProg
left a comment
There was a problem hiding this comment.
LGTM but I'll let someone more familiar with Qwen3 approve
is there someone familiar with Qwen3-next except @sighingnow ? |
|
cc @tlrmchlsmth |
|
@codex review |
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
youkaichao
left a comment
There was a problem hiding this comment.
looks good! cc @zhiyuan1i we should be able to do similar optimization for kimi linear.
I have made a pr #27871 for kimi. |
39cd84b to
090a44b
Compare
|
Regarding CI fails. It looks like they are not related to this PR and a lot of latest commits to fail also has similar fails. |
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Head branch was pushed to by a user without write access
70d2048 to
c3197c6
Compare
|
CI results are weird. I can't repeat them locally. |
|
Can you fix the pre-commit and rebase your PR from main? There are many CI fixes on main branch these days. |
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
7f164be to
f276f69
Compare
Rebased. Still the same and still can't reproduce locally :( |
…7512) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Purpose
This PR is refactoring of GDN.
The main goal is to allow wider using of
torch.compile.torch.compile.RMSNormGatedclass that implements torch native gated rmsnorm and use it for GDN.torch.compilecreates a good code forRMSNormGatedeven better than custom triton kernel used before.Functional Test Result
lm_evalBefore
After
Perf Test Result
Server
Prefill
Before: Total Token throughput (tok/s): 104098.78
After: Total Token throughput (tok/s): 105270.70
Speedup: 1.1%
Decode1
Before: Output token throughput (tok/s): 19212.17
After: Output token throughput (tok/s): 22384.37
Speedup: 16.5%
Decode2
Before: Output token throughput (tok/s): 28821.37
After: Output token throughput (tok/s): 30298.90
Speed up: 5.1%
Decode3
Server
(without increasing
--max_cudagraph_capture_size)Before: Output token throughput (tok/s): 16586.93
After: Output token throughput (tok/s): 18953.92
Speed up: 14.3%