ggml-cuda: gdn use shared mem for HIP by am17an · Pull Request #20282 · ggml-org/llama.cpp

am17an · 2026-03-09T09:53:09Z

Make sure to read the contributing guidelines before submitting a PR

ggml-cuda: gdn use shared mem for HIP This PR optimizes GDN operations for AMD GPU (HIP) by using shared memory instead of registers, improving performance on HIP/ROCm platforms. Main changes: - Use shared memory for state data in HIP platform - Fix assertion in mamba-base.cpp (n_embd vs d_state) - Remove obsolete map_developer_role_to_system function - Enhance chat template tests

…red" This reverts commit ef56a76, reversing changes made to f76565d.

IMbackK · 2026-03-09T18:09:06Z

This resolves the excessive register spill on GCN/CDNA yes:

b8232
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64

model	size	params	backend	ngl	fa	test	t/s
qwen35moe ?B Q8_0	28.21 GiB	34.66 B	ROCm	99	1	tg128	58.81 ± 0.20

hip-gdn-use-shared
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q8_0	28.21 GiB	34.66 B	ROCm	99	1	tg128	63.65 ± 0.14

as expected however this is a pessimization for hip devices with a large enough register file for the original kernel.

master
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB (24482 MiB free)

model	size	params	backend	ngl	fa	test	t/s
qwen35 27B Q4_K - Medium	15.58 GiB	26.90 B	ROCm	99	1	tg128	31.16 ± 0.02

hip-gdn-use-shared
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB (24482 MiB free)

model	size	params	backend	ngl	fa	test	t/s
qwen35 27B Q4_K - Medium	15.58 GiB	26.90 B	ROCm	99	1	tg128	29.54 ± 0.04

We need to avoid doing this on large register file RDNA3 and RDNA4 and maybe also the other RDNA variants (safe thing is to apply it for those now however)

IMbackK · 2026-03-09T18:28:05Z

also to no ones suprise:

Backend 1/2: ROCm0
  Device description: AMD Instinct MI100
  Device memory: 32752 MB (32724 MB free)

  GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK
  13/13 tests passed
  Backend ROCm0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

androidi · 2026-03-10T01:19:29Z

I tested the hip-gdn-use-shared version. Qwen3.5 35B and 122B both get nice boost, but Nemotron-3-Nano (MXFP4_MOE quant by noctrex) now fails to load. I have a mixed deck of RTX 4090 + 2x MI100 + 2x MI50, so I've built a version with GGML_RPC=ON, etc. params. Nemotron-3-Nano does not load on MI50/MI100, nor does it load on RTX 4090.

On each card, llama.cpp dies on:

/home/le_me/llama.cpp-fix/src/models/mamba-base.cpp:173: GGML_ASSERT(d_inner % (n_group*n_embd) == 0) failed

am17an · 2026-03-10T01:37:53Z

That is unrelated to this PR. It should be fixed by rebasing the PR to master

am17an · 2026-03-10T01:56:27Z

@IMbackK can you add the appropriate compiler flags for RDNA3/4 to enable register use whenever feasible?

IMbackK · 2026-03-10T08:31:45Z

@am17an the hip compiler will use all registers available to the target architecture automatically.

All we need to do here is replace defined(GGML_USE_HIP) with !defined(RDNA3) || !defined(RDNA4) in the kernel adjust the kernel smem based on CC

btw CUDA_SET_SHARED_MEMORY_LIMIT is a noop on hip. all amd devices since 2012 have at least 64KiB of smem

IMbackK

@am17an this has several issues:

defined(RDNA4) and freinds can only be used in the kernel, not in host code, we dont know the gpu at compile time there.
as mentioned CUDA_SET_SHARED_MEMORY_LIMIT inside of a hip only block is pointless
your preprocessor condition in the kernel also matches all cuda devices as they ofc are neither RDNA3 nor 4

I opened a branch here that fixes these issues: #20366

This was referenced Mar 9, 2026

Eval bug: Regression from 8189 to 8234 - TG down from 37tps to 29tps Qwen3.5 35b #20237

Closed

ggml-cuda: disable gdn for musa #20278

Merged

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 9, 2026

simonxluo added a commit to simonxluo/llama.cpp that referenced this pull request Mar 9, 2026

Revert "Merge pull request ggml-org#20282 from am17an:hip-gdn-use-sha…

637947c

…red" This reverts commit ef56a76, reversing changes made to f76565d.

simonxluo mentioned this pull request Mar 9, 2026

ggml-cuda : reduce redundant expf calls in GDN KDA kernel #20295

Closed

am17an added 2 commits March 10, 2026 02:49

ggml-cuda: gdn use shared mem for HIP

7e89da1

const

f737b42

am17an force-pushed the hip-gdn-use-shared branch from 2d6b28d to f737b42 Compare March 10, 2026 01:57

change guard to include RDNA devices

b5b11e3

am17an marked this pull request as ready for review March 10, 2026 10:15

CISC approved these changes Mar 10, 2026

View reviewed changes

0cc4m mentioned this pull request Mar 10, 2026

Vulkan: missing GATED_DELTA_NET compute shader; ROCm/HIP: fused kernel underperforms on RDNA 3.5 (gfx1151) #20354

Closed

CISC mentioned this pull request Mar 10, 2026

ggml-cuda: gdn use shared mem for HIP #20366

Merged

IMbackK suggested changes Mar 10, 2026

View reviewed changes

am17an closed this Mar 11, 2026

slavap mentioned this pull request Mar 11, 2026

Misc. bug: tg broken after b8210 #20240

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: gdn use shared mem for HIP#20282

ggml-cuda: gdn use shared mem for HIP#20282
am17an wants to merge 3 commits intoggml-org:masterfrom
am17an:hip-gdn-use-shared

am17an commented Mar 9, 2026

Uh oh!

IMbackK commented Mar 9, 2026 •

edited

Loading

Uh oh!

IMbackK commented Mar 9, 2026

Uh oh!

androidi commented Mar 10, 2026 •

edited

Loading

Uh oh!

am17an commented Mar 10, 2026

Uh oh!

am17an commented Mar 10, 2026

Uh oh!

IMbackK commented Mar 10, 2026 •

edited

Loading

Uh oh!

IMbackK left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

am17an commented Mar 9, 2026

Uh oh!

IMbackK commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Mar 9, 2026

Uh oh!

androidi commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Mar 10, 2026

Uh oh!

am17an commented Mar 10, 2026

Uh oh!

IMbackK commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IMbackK commented Mar 9, 2026 •

edited

Loading

androidi commented Mar 10, 2026 •

edited

Loading

IMbackK commented Mar 10, 2026 •

edited

Loading

IMbackK left a comment •

edited

Loading