ggml-cuda: gdn use shared mem for HIP#20282
Conversation
ggml-cuda: gdn use shared mem for HIP This PR optimizes GDN operations for AMD GPU (HIP) by using shared memory instead of registers, improving performance on HIP/ROCm platforms. Main changes: - Use shared memory for state data in HIP platform - Fix assertion in mamba-base.cpp (n_embd vs d_state) - Remove obsolete map_developer_role_to_system function - Enhance chat template tests
|
This resolves the excessive register spill on GCN/CDNA yes: b8232
hip-gdn-use-shared
as expected however this is a pessimization for hip devices with a large enough register file for the original kernel. master
hip-gdn-use-shared
We need to avoid doing this on large register file RDNA3 and RDNA4 and maybe also the other RDNA variants (safe thing is to apply it for those now however) |
|
also to no ones suprise: |
|
I tested the hip-gdn-use-shared version. Qwen3.5 35B and 122B both get nice boost, but Nemotron-3-Nano (MXFP4_MOE quant by noctrex) now fails to load. I have a mixed deck of RTX 4090 + 2x MI100 + 2x MI50, so I've built a version with GGML_RPC=ON, etc. params. Nemotron-3-Nano does not load on MI50/MI100, nor does it load on RTX 4090. On each card, llama.cpp dies on: /home/le_me/llama.cpp-fix/src/models/mamba-base.cpp:173: GGML_ASSERT(d_inner % (n_group*n_embd) == 0) failed |
|
That is unrelated to this PR. It should be fixed by rebasing the PR to master |
|
@IMbackK can you add the appropriate compiler flags for RDNA3/4 to enable register use whenever feasible? |
2d6b28d to
f737b42
Compare
|
@am17an the hip compiler will use all registers available to the target architecture automatically. All we need to do here is replace defined(GGML_USE_HIP) with !defined(RDNA3) || !defined(RDNA4) in the kernel adjust the kernel smem based on CC btw CUDA_SET_SHARED_MEMORY_LIMIT is a noop on hip. all amd devices since 2012 have at least 64KiB of smem |
There was a problem hiding this comment.
@am17an this has several issues:
- defined(RDNA4) and freinds can only be used in the kernel, not in host code, we dont know the gpu at compile time there.
- as mentioned CUDA_SET_SHARED_MEMORY_LIMIT inside of a hip only block is pointless
- your preprocessor condition in the kernel also matches all cuda devices as they ofc are neither RDNA3 nor 4
I opened a branch here that fixes these issues: #20366
Make sure to read the contributing guidelines before submitting a PR