[amd][gptoss] Perf gain because of block alignment#28024
[amd][gptoss] Perf gain because of block alignment#28024heheda12345 merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for the fused_moe kernel on AMD GPUs by dynamically setting the padding alignment based on the GPU architecture. The changes replace a hardcoded padding value with a function that queries the hardware, which should improve performance as described. My review identifies a critical issue where the new utility function could cause a runtime crash if the optional triton package is not installed. I've provided a suggestion to make the code more robust by adding a check for Triton's availability.
5a537a4 to
85f3030
Compare
Summary: Following patch is from Aliasger Zaidy(azaid) and Shucai Xiao(scxiao) from AMD, and overall efforts for the integration is guided by Xiaozhu Meng(mxz297) from Meta, it boosts the performance for fused_moe kernel. We pad to 128 for MI300 to avoid masked loads. We pad to 256 for MI355 because we use scale preshuffling on 355 and padding to 256 is needed to enable correct preshuffle arrangement 10% Performance boost is achieved for gptoss120b on AMD mi300 machine. Test Plan: No eval regression is observed. | Effort Level | Score | Characters | Chars Std | Score Std | |--------------|-------|------------|-----------|-----------| | Low | 0.51 | 1577.26 | 1001.32 | 0.49 | | Medium | 0.79 | 1991.975 | 785.68 | 0.40 | | High | 0.916 | 2568 | 1029.9 | 0.28 | | Effort Level | Score | Characters | Chars Std | Score Std | |--------------|-------|------------|-----------|-----------| | Low | 0.51 | 1570.26 | 1001.32 | 0.49 | | Medium | 0.79 | 1990.975 | 780.68 | 0.40 | | High | 0.916 | 2508 | 1020.9 | 0.28 | Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
0113001 to
ec4af3f
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
ec4af3f to
b94053d
Compare
| def get_padding_alignment(): | ||
| return ( | ||
| 256 | ||
| if triton.runtime.driver.active.get_current_target().arch in ("gfx950",) |
There was a problem hiding this comment.
The default value was 256. Will it be safer to only update MI300's alignment to 128? Or do you think 128 will be faster on other architectures?
There was a problem hiding this comment.
256 is needed to enable correct preshuffle arrangement and scale pre-shuffling only used in MI355, that's why i believe 128 will be faster on other architectures.
HAIAI
left a comment
There was a problem hiding this comment.
@smitkadvani LGTM, thanks!
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com> Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com>
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com> Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com>
Summary:
Signed-off-by: Smit Kadvani smit.kadvani@gmail.com
Summary:
Following patch is from Aliasger Zaidy(azaid) and Shucai Xiao(scxiao) from AMD, and overall efforts for the integration is guided by Xiaozhu Meng(mxz297) from Meta, it boosts the performance for fused_moe kernel.
We pad to 128 for MI300 to avoid masked loads.
We pad to 256 for MI355 because we use scale preshuffling on 355 and padding to 256 is needed to enable correct preshuffle arrangement
10% Performance boost is achieved for gptoss120b on AMD mi300 machine.
Test Plan:
Test Plan:
No eval regression is observed.
Eval on aime25
with patch
without patch