Skip to content

ggml-cuda: gdn use shared mem for HIP#20282

Closed
am17an wants to merge 3 commits intoggml-org:masterfrom
am17an:hip-gdn-use-shared
Closed

ggml-cuda: gdn use shared mem for HIP#20282
am17an wants to merge 3 commits intoggml-org:masterfrom
am17an:hip-gdn-use-shared

Conversation

@am17an
Copy link
Contributor

@am17an am17an commented Mar 9, 2026

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 9, 2026
simonxluo added a commit to simonxluo/llama.cpp that referenced this pull request Mar 9, 2026
ggml-cuda: gdn use shared mem for HIP

This PR optimizes GDN operations for AMD GPU (HIP) by using shared memory
instead of registers, improving performance on HIP/ROCm platforms.

Main changes:
- Use shared memory for state data in HIP platform
- Fix assertion in mamba-base.cpp (n_embd vs d_state)
- Remove obsolete map_developer_role_to_system function
- Enhance chat template tests
simonxluo added a commit to simonxluo/llama.cpp that referenced this pull request Mar 9, 2026
…red"

This reverts commit ef56a76, reversing
changes made to f76565d.
@IMbackK
Copy link
Collaborator

IMbackK commented Mar 9, 2026

This resolves the excessive register spill on GCN/CDNA yes:

b8232
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64

model size params backend ngl fa test t/s
qwen35moe ?B Q8_0 28.21 GiB 34.66 B ROCm 99 1 tg128 58.81 ± 0.20

hip-gdn-use-shared
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q8_0 28.21 GiB 34.66 B ROCm 99 1 tg128 63.65 ± 0.14

as expected however this is a pessimization for hip devices with a large enough register file for the original kernel.

master
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB (24482 MiB free)

model size params backend ngl fa test t/s
qwen35 27B Q4_K - Medium 15.58 GiB 26.90 B ROCm 99 1 tg128 31.16 ± 0.02

hip-gdn-use-shared
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB (24482 MiB free)

model size params backend ngl fa test t/s
qwen35 27B Q4_K - Medium 15.58 GiB 26.90 B ROCm 99 1 tg128 29.54 ± 0.04

We need to avoid doing this on large register file RDNA3 and RDNA4 and maybe also the other RDNA variants (safe thing is to apply it for those now however)

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 9, 2026

also to no ones suprise:

Backend 1/2: ROCm0
  Device description: AMD Instinct MI100
  Device memory: 32752 MB (32724 MB free)

  GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
  13/13 tests passed
  Backend ROCm0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

@androidi
Copy link

androidi commented Mar 10, 2026

I tested the hip-gdn-use-shared version. Qwen3.5 35B and 122B both get nice boost, but Nemotron-3-Nano (MXFP4_MOE quant by noctrex) now fails to load. I have a mixed deck of RTX 4090 + 2x MI100 + 2x MI50, so I've built a version with GGML_RPC=ON, etc. params. Nemotron-3-Nano does not load on MI50/MI100, nor does it load on RTX 4090.

On each card, llama.cpp dies on:

/home/le_me/llama.cpp-fix/src/models/mamba-base.cpp:173: GGML_ASSERT(d_inner % (n_group*n_embd) == 0) failed

@am17an
Copy link
Contributor Author

am17an commented Mar 10, 2026

That is unrelated to this PR. It should be fixed by rebasing the PR to master

@am17an
Copy link
Contributor Author

am17an commented Mar 10, 2026

@IMbackK can you add the appropriate compiler flags for RDNA3/4 to enable register use whenever feasible?

@am17an am17an force-pushed the hip-gdn-use-shared branch from 2d6b28d to f737b42 Compare March 10, 2026 01:57
@IMbackK
Copy link
Collaborator

IMbackK commented Mar 10, 2026

@am17an the hip compiler will use all registers available to the target architecture automatically.

All we need to do here is replace defined(GGML_USE_HIP) with !defined(RDNA3) || !defined(RDNA4) in the kernel adjust the kernel smem based on CC

btw CUDA_SET_SHARED_MEMORY_LIMIT is a noop on hip. all amd devices since 2012 have at least 64KiB of smem

@am17an am17an marked this pull request as ready for review March 10, 2026 10:15
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@am17an this has several issues:

  1. defined(RDNA4) and freinds can only be used in the kernel, not in host code, we dont know the gpu at compile time there.
  2. as mentioned CUDA_SET_SHARED_MEMORY_LIMIT inside of a hip only block is pointless
  3. your preprocessor condition in the kernel also matches all cuda devices as they ofc are neither RDNA3 nor 4

I opened a branch here that fixes these issues: #20366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants