Skip to content

[amd][gptoss] Perf gain because of block alignment#28024

Merged
heheda12345 merged 3 commits intovllm-project:mainfrom
smitkadvani:export-D84643814
Nov 7, 2025
Merged

[amd][gptoss] Perf gain because of block alignment#28024
heheda12345 merged 3 commits intovllm-project:mainfrom
smitkadvani:export-D84643814

Conversation

@smitkadvani
Copy link
Copy Markdown
Contributor

@smitkadvani smitkadvani commented Nov 4, 2025

Summary:
Signed-off-by: Smit Kadvani smit.kadvani@gmail.com

Summary:

Following patch is from Aliasger Zaidy(azaid) and Shucai Xiao(scxiao) from AMD, and overall efforts for the integration is guided by Xiaozhu Meng(mxz297) from Meta, it boosts the performance for fused_moe kernel.
We pad to 128 for MI300 to avoid masked loads.
We pad to 256 for MI355 because we use scale preshuffling on 355 and padding to 256 is needed to enable correct preshuffle arrangement

10% Performance boost is achieved for gptoss120b on AMD mi300 machine.

Test Plan:

Test Plan:

No eval regression is observed.

Eval on aime25

with patch

Effort Level Score Characters Chars Std Score Std
Low 0.51 1577.26 1001.32 0.49
Medium 0.79 1991.975 785.68 0.40
High 0.916 2568 1029.9 0.28

without patch

Effort Level Score Characters Chars Std Score Std
Low 0.51 1570.26 1001.32 0.49
Medium 0.79 1990.975 780.68 0.40
High 0.916 2508 1020.9 0.28

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance optimization for the fused_moe kernel on AMD GPUs by dynamically setting the padding alignment based on the GPU architecture. The changes replace a hardcoded padding value with a function that queries the hardware, which should improve performance as described. My review identifies a critical issue where the new utility function could cause a runtime crash if the optional triton package is not installed. I've provided a suggestion to make the code more robust by adding a check for Triton's availability.

@smitkadvani smitkadvani force-pushed the export-D84643814 branch 2 times, most recently from 5a537a4 to 85f3030 Compare November 4, 2025 06:12
@smitkadvani smitkadvani changed the title Perf gain because of block alignment [amd][gptoss] Perf gain because of block alignment Nov 4, 2025
@mergify mergify bot added gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm labels Nov 4, 2025
Summary:
Following patch is from Aliasger Zaidy(azaid) and Shucai Xiao(scxiao) from AMD, and overall efforts for the integration is guided by Xiaozhu Meng(mxz297) from Meta, it boosts the performance for fused_moe kernel.
We pad to 128 for MI300 to avoid masked loads.
We pad to 256 for MI355 because we use scale preshuffling on 355 and padding to 256 is needed to enable correct preshuffle arrangement

10% Performance boost is achieved for gptoss120b on AMD mi300 machine.

Test Plan:
No eval regression is observed.

| Effort Level |  Score | Characters | Chars Std | Score Std |
|--------------|-------|------------|-----------|-----------|
| Low  | 0.51 | 1577.26 | 1001.32 | 0.49 |
| Medium  | 0.79 | 1991.975 | 785.68 | 0.40 |
| High  | 0.916 | 2568 | 1029.9 | 0.28 |

| Effort Level |  Score | Characters | Chars Std | Score Std |
|--------------|-------|------------|-----------|-----------|
| Low  | 0.51 | 1570.26 | 1001.32 | 0.49 |
| Medium  | 0.79 | 1990.975 | 780.68 | 0.40 |
| High  | 0.916 | 2508 | 1020.9 | 0.28 |

Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
@mergify mergify bot added the v1 label Nov 4, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
def get_padding_alignment():
return (
256
if triton.runtime.driver.active.get_current_target().arch in ("gfx950",)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value was 256. Will it be safer to only update MI300's alignment to 128? Or do you think 128 will be faster on other architectures?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256 is needed to enable correct preshuffle arrangement and scale pre-shuffling only used in MI355, that's why i believe 128 will be faster on other architectures.

Copy link
Copy Markdown
Collaborator

@HAIAI HAIAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smitkadvani LGTM, thanks!

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Nov 5, 2025
@heheda12345 heheda12345 enabled auto-merge (squash) November 5, 2025 21:40
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2025
@heheda12345 heheda12345 merged commit 11fd69d into vllm-project:main Nov 7, 2025
53 checks passed
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com>
@Rohan138 Rohan138 mentioned this pull request Jan 14, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants