[PERF] Decouple projections from GDN custom op by vadiklyutiy · Pull Request #27512 · vllm-project/vllm

vadiklyutiy · 2025-10-25T20:46:47Z

Purpose

This PR is refactoring of GDN.

The main goal is to allow wider using of torch.compile.

Separated forward pass of GDN attention into three distinct pieces: Input Projection, Core Attention, Output Projection. Before projections was in the GDN custom op and were not covered by torch.compile.
Added RMSNormGated class that implements torch native gated rmsnorm and use it for GDN. torch.compile creates a good code for RMSNormGated even better than custom triton kernel used before.

Functional Test Result

lm_eval
Before

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8059|±  |0.0109|

After

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8544|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

Perf Test Result

Server

VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_USE_FLASHINFER_MOE_FP16=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling --max_cudagraph_capture_size=2048

Prefill

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 8192 --random-output 1 --max-concurrency 512 --num-prompt 512 --ignore-eos

Before: Total Token throughput (tok/s): 104098.78
After: Total Token throughput (tok/s): 105270.70
Speedup: 1.1%

Decode1

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 512 --num-prompt 512 --ignore-eos

Before: Output token throughput (tok/s): 19212.17
After: Output token throughput (tok/s): 22384.37
Speedup: 16.5%

Decode2

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos

Before: Output token throughput (tok/s): 28821.37
After: Output token throughput (tok/s): 30298.90
Speed up: 5.1%

Decode3
Server

VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_USE_FLASHINFER_MOE_FP16=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling

(without increasing --max_cudagraph_capture_size)

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos

Before: Output token throughput (tok/s): 16586.93
After: Output token throughput (tok/s): 18953.92
Speed up: 14.3%

gemini-code-assist

Code Review

This pull request refactors the Gated Delta Net (GDN) attention mechanism to improve torch.compile compatibility and performance. By decoupling the input/output projections from the core custom operator and introducing a native PyTorch RMSNormGated layer, the changes yield significant decode throughput improvements. The refactoring is well-executed and the code is clear. I have one high-severity suggestion regarding a local import in a performance-critical path, which should be moved to the top level of the module to adhere to best practices and avoid potential overhead.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

ZJY0516 · 2025-10-27T09:07:50Z

CC @heheda12345

vadiklyutiy · 2025-10-28T23:09:58Z

@ALL
Could you please take a look at this PR

mergify · 2025-10-30T19:11:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg

LGTM but I'll let someone more familiar with Qwen3 approve

vadiklyutiy · 2025-10-31T00:46:07Z

LGTM but I'll let someone more familiar with Qwen3 approve

is there someone familiar with Qwen3-next except @sighingnow ?

mgoin · 2025-10-31T01:58:43Z

cc @tlrmchlsmth

heheda12345

LGTM!

heheda12345 · 2025-10-31T06:01:08Z

@codex review

chatgpt-codex-connector · 2025-10-31T06:06:53Z

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

youkaichao

looks good! cc @zhiyuan1i we should be able to do similar optimization for kimi linear.

ZJY0516 · 2025-10-31T10:50:51Z

looks good! cc @zhiyuan1i we should be able to do similar optimization for kimi linear.

I have made a pr #27871 for kimi.

vadiklyutiy · 2025-11-01T16:38:37Z

Regarding CI fails. It looks like they are not related to this PR and a lot of latest commits to fail also has similar fails.
Is it ok to merge this PR?

heheda12345 · 2025-11-01T22:46:25Z

@codex review

chatgpt-codex-connector · 2025-11-01T22:51:10Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vadiklyutiy · 2025-11-03T20:26:53Z

CI results are weird. I can't repeat them locally.
Try to disable some parts to localize

heheda12345 · 2025-11-03T22:51:52Z

Can you fix the pre-commit and rebase your PR from main? There are many CI fixes on main branch these days.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy · 2025-11-04T10:37:13Z

Can you fix the pre-commit and rebase your PR from main? There are many CI fixes on main branch these days.

Rebased. Still the same and still can't reproduce locally :(

…7512) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, sighingnow, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 25, 2025 20:46

mergify bot added the qwen Related to Qwen models label Oct 25, 2025

gemini-code-assist bot reviewed Oct 25, 2025

View reviewed changes

Comment thread vllm/model_executor/layers/layernorm.py

chatgpt-codex-connector bot reviewed Oct 25, 2025

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_next.py

mergify bot added the needs-rebase label Oct 30, 2025

ProExpertProg reviewed Oct 30, 2025

View reviewed changes

Comment thread vllm/model_executor/layers/layernorm.py Outdated

Comment thread vllm/model_executor/layers/layernorm.py Outdated

mergify bot removed the needs-rebase label Oct 31, 2025

heheda12345 approved these changes Oct 31, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) October 31, 2025 06:07

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 31, 2025

youkaichao approved these changes Oct 31, 2025

View reviewed changes

youkaichao disabled auto-merge October 31, 2025 13:42

vadiklyutiy force-pushed the vadim/refac-gdn branch 2 times, most recently from 39cd84b to 090a44b Compare November 1, 2025 00:05

heheda12345 enabled auto-merge (squash) November 1, 2025 22:53

sighingnow approved these changes Nov 3, 2025

View reviewed changes

auto-merge was automatically disabled November 3, 2025 09:07
Head branch was pushed to by a user without write access

vadiklyutiy force-pushed the vadim/refac-gdn branch 2 times, most recently from 70d2048 to c3197c6 Compare November 3, 2025 12:21

vadiklyutiy added 4 commits November 4, 2025 11:21

GDN refacoring

8f45f6d

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

fix

c955727

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Make RMSNormGated a CustomOp

d710a90

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

fix pre-commit

f276f69

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy force-pushed the vadim/refac-gdn branch from 7f164be to f276f69 Compare November 4, 2025 07:22

simon-mo merged commit 5fd8f02 into vllm-project:main Nov 4, 2025
54 of 58 checks passed

vadiklyutiy mentioned this pull request Nov 4, 2025

Revert "[PERF] Decouple projections from GDN custom op" #28080

Merged

vadiklyutiy added a commit to CentML/vllm that referenced this pull request Nov 4, 2025

Revert [PERF] Decouple projections from GDN custom op (vllm-project#2…

fb128a1

…7512) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy deleted the vadim/refac-gdn branch November 5, 2025 00:41

vadiklyutiy mentioned this pull request Nov 5, 2025

[PERF] Decouple projections from GDN custom op. Attempt 2 #28083

Merged

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[PERF] Decouple projections from GDN custom op (vllm-project#27512)

5ff3e3e

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[PERF] Decouple projections from GDN custom op (vllm-project#27512)

03215ef

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Uh oh!

Conversation

vadiklyutiy commented Oct 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Functional Test Result

Perf Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ZJY0516 commented Oct 27, 2025

Uh oh!

vadiklyutiy commented Oct 28, 2025

Uh oh!

mergify bot commented Oct 30, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy commented Oct 31, 2025

Uh oh!

mgoin commented Oct 31, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Oct 31, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 31, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Oct 31, 2025

Uh oh!

vadiklyutiy commented Nov 1, 2025

Uh oh!

heheda12345 commented Nov 1, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 1, 2025

Uh oh!

vadiklyutiy commented Nov 3, 2025

Uh oh!

heheda12345 commented Nov 3, 2025

Uh oh!

vadiklyutiy commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

vadiklyutiy commented Oct 25, 2025 •

edited by github-actions bot

Loading