CUDA: fix MMQ nwarps for AMD with warp_size==32#15014
Merged
IMbackK merged 1 commit intoggml-org:masterfrom Aug 1, 2025
Merged
CUDA: fix MMQ nwarps for AMD with warp_size==32#15014IMbackK merged 1 commit intoggml-org:masterfrom
IMbackK merged 1 commit intoggml-org:masterfrom
Conversation
IMbackK
approved these changes
Aug 1, 2025
Collaborator
There was a problem hiding this comment.
Static analysis looks good and i can also confirm this resolves the regression on gfx1030
pr
Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 1666.06 ± 2.48 |
master
Device 0: AMD Radeon RX 6800 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 1363.94 ± 1.09 |
8133/8133 tests passed
Backend ROCm0: OK
Backend 2/2: CPU
Nexesenex
added a commit
to Nexesenex/croco.cpp
that referenced
this pull request
Aug 7, 2025
…)" This reverts commit 9c35706.
Nexesenex
added a commit
to Nexesenex/croco.cpp
that referenced
this pull request
Oct 6, 2025
blime4
referenced
this pull request
in blime4/llama.cpp
Feb 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It seems that in #14624 the number of warps for AMD was accidentally changed from 8 to 4 for all warp sizes. As a consequence for AMD GPUs with a warp size of 32 the performance has gotten worse since the kernel is only using half the threads. This PR makes it so that a constant number of 256 threads is used and that those threads are either spread across 4 or 8 warps depending on warp size.