HIP: enable WMMA-MMQ INT kernels for RDNA 3#17576
Conversation
ggml/src/ggml-cuda/mma.cuh
Outdated
| static constexpr int ne = I * J / 32; | ||
| #elif defined(RDNA3) | ||
| static constexpr int ne = (I == 16 && J == 16) ? I * J / 32 : I * J / 16; | ||
| #endif |
There was a problem hiding this comment.
| #endif | |
| #endif // defined(RDNA4) |
Please add comments to indicate which #if/#ifdef and #endif is closing.
ggml/src/ggml-cuda/mmq.cu
Outdated
| if (GGML_CUDA_CC_IS_RDNA4(cc) || GGML_CUDA_CC_IS_RDNA3(cc)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
| if (GGML_CUDA_CC_IS_RDNA4(cc) || GGML_CUDA_CC_IS_RDNA3(cc)) { | |
| return true; | |
| } | |
| return true; |
ggml/src/ggml-cuda/mmq.cuh
Outdated
| A1.x[0] = 0x01010101; | ||
| A1.x[1] = 0x01010101; | ||
| A1.x[2] = 0x01010101; | ||
| A1.x[3] = 0x01010101; |
There was a problem hiding this comment.
| A1.x[0] = 0x01010101; | |
| A1.x[1] = 0x01010101; | |
| A1.x[2] = 0x01010101; | |
| A1.x[3] = 0x01010101; | |
| #pragma unroll | |
| for (int l = 0; l < tile_A::ne; ++l) { | |
| A1.x[l] = 0x01010101; | |
| } |
To my understanding tile_A has 4 elements for RDNA3 but for RDNA4 it only has 2 elements. So as it is this would result in out-of-bounds writes and potential memory trampling for RDNA4.
Performance
In terms of performance I think this PR would be good to merge. There are some cases around batch size 32 that have suboptimal performance but that particular batch size is comparatively less important vs. the larger ones. So I think it would be fine to merge the PR as-is and to maybe optimize that use case in a follow-up PR. (Batch sizes 1-8 are using the same code for both tests so changes there are just random noise and can be ignored, I only included them to investigate the scaling.) |
|
(This PR still needs a rebase on top of master.) |
a34b76f to
c9ec96c
Compare
has been done |
ggml/src/ggml-cuda/mma.cuh
Outdated
| ); | ||
| #endif // defined(RDNA4) | ||
|
|
||
| #elif defined(RDNA3) |
There was a problem hiding this comment.
| #elif defined(RDNA3) | |
| #elif defined(RDNA3) |
To fix the EditorConfig CI.
5941226 to
685be0e
Compare
|
Seems like some minor issues on RDNA4: hjc4869@df264e1 |
|
Thank you for this! Those are great speed up results. I did get some errors and my build failed, could you please take a look? |
|
On GFX1100 + ROC641 seems like this commit is causing |
I believe this is due to FP16/BF16 MMF kernels not been enabled yet, once this pr (#17495) get merged it should no longer get this failure |
I suppose ROCm builds will just be broken for RDNA 3 until someone finds the time to finish that PR then? |
This reverts commit 668ed76.
* enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI
* enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI
This reverts commit 668ed76.
* enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI
* enabled wmma instructions for most quantizations other than q2k * fixed the last q2_k test case failure * address comments: fix out of bound write for RDNA4, add comments after #endif * clean up rebase: fix ne error in half2 * fix the EditorConfig CI
Enabled WMMA-MMQ INT kernels for RDNA 3 architecture on AMD GPUs
Following similar approach to #17156
Using ./build/bin/llama-bench to collect the following performance results
Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_CUDA_FORCE_MMQ=OFF -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=OFF -DGPU_TARGETS="gfx1100" -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32
Popular models performance result for AMD Radeon AI PRO W7900 (gfx1100)
Popular models performance result for AMD Strix Halo (gfx1151)
All quantization performance result for AMD Radeon AI PRO W7900 (gfx1100)
All quantization performance result for AMD Strix Halo (gfx1151)