HIP/ROCm: two crash fixes for TurboQuant KV cache on RDNA #4
Merged
Ooooze merged 2 commits intoMay 7, 2026
Merged
Conversation
The HIP fattn-vec build list was missing three cross-type instances
(f16 key + turbo2/3/4 value) that were already present in the CUDA
CMakeLists. This caused linker errors of the form:
undefined reference to void ggml_cuda_flash_attn_ext_vec_case<
256, (ggml_type)1, (ggml_type)42/43/44>
when building llama-server with GGML_HIP=ON and TurboQuant KV cache
enabled.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Models with head_dim=512 (e.g. Gemma 4 E4B: n_embd=4096, n_head=8)
always use the TILE flash-attention path on AMD/HIP because VEC is
capped at head_dim<=256 and WMMA/MFMA explicitly exclude D=512.
Inside launch_fattn_tile_switch_ncols2<512,512>, the DKQ<=512 block
only had fallback cases for gqa_ratio divisible by 4 or 8, then a
DV<=256 guard for ratio=2/1. For DV=512 with gqa_ratio=2 (Gemma 4:
8 Q-heads / 4 KV-heads) the code fell through to GGML_ABORT.
Fix two things:
1. Dispatch: add ncols2=2 and ncols2=1 fallbacks inside the DKQ<=512
block for the DV>256 case, mirroring what already exists for DV<=256.
2. Kernel configs: add the missing ncols=2 entry for DKQ=DV=512 in all
four config tables (nvidia_fp16, nvidia_fp32, amd, amd_rdna).
Without these entries the device-side static_assert would fire at
compile time for flash_attn_tile<512,512,{1,2},2,*>.
Tested on gfx1150 (Ryzen AI HX 470, RDNA3.5) running Gemma 4 E4B
with -ctk turbo3 -ctv turbo3 and --mtp-head speculative decoding.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2e81dc5
into
AtomicBot-ai:feature/turboquant-kv-cache
1 check passed
|
Thanks for the fix and the detailed writeup — much appreciated! |
Author
|
You're welcome ! Your project is a nice improvement to llama-cpp, do you
plane to upstream some day ?
…On Thu, May 7, 2026 at 7:05 PM Ooze ***@***.***> wrote:
*Ooooze* left a comment (AtomicBot-ai/atomic-llama-cpp-turboquant#4)
<#4 (comment)>
Thanks for the fix and the detailed writeup — much appreciated!
—
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAELLFQZTULHVE2QRSCQHJ34ZS64TAVCNFSM6AAAAACYUWABHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOJZGI3TAMZRGM>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR fixes a linking problem and a runtime crash when using mtp models like Gemma 4 assistant.
Tested on a Ryzen AI HX 470 (gfx1150, RDNA3.5) running Gemma 4 E4B with:
./build/bin/llama-server \ -m ./models/gemma-4-E4B-it-Q4_K_M.gguf \ --mtp-head ./models/gemma-4-E4B-it-assistant.Q4_K_M.gguf \ --spec-type mtp \ --draft-block-size 3 --draft-max 8 --draft-min 0 \ -ngl 99 -ngld 99 \ -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \ -fa on -c 16384 --host 127.0.0.1 --port 8080OS : Linux Mint 22.3 (based on Ubuntu 20.04.04) with ROCm 7.2.1 installed
Fix 1 — HIP linker error: missing fattn-vec template instances
ggml/src/ggml-hip/CMakeLists.txtwas missing three cross-type flash-attention VEC instances (f16 key × turbo2/3/4 value) that were already present inggml/src/ggml-cuda/CMakeLists.txt.This produced link errors at the final llama-server link step:
Fix: added the three files to the HIP CMake list.
Fix 2 — Runtime GGML_ABORT in fattn-tile.cuh for head_dim=512
Gemma 4 E4B has head_dim = 4096 / 8 = 512. For head_dim=512, all fast FA paths are excluded on AMD:
So TILE is always selected. Inside
launch_fattn_tile_switch_ncols2<512, 512>, theDKQ ≤ 512block only handledgqa_ratio % 4 == 0andgqa_ratio % 8 == 0, then aDV ≤ 256guard for smaller ratios. For DV=512 with gqa_ratio=2 (Gemma 4: 8 Q-heads / 4 KV-heads) the code fell straight through to GGML_ABORT("fatal error").Fix:
What fix this PR
Before fix 1: linker error, binary not produced.
Before fix 2: crash during first decode step with fattn-tile.cuh:1263: fatal error.
After both fixes: server runs without crash, MTP speculative decoding functional.
Test procedure
Build
Benchmark
Here is a table with all tests result made.
Details of the command use to launch the server :
Baseline - Standard llama.cpp from llamacpp-rocm :
./llama-server -m ../atomic-llama-cpp-turboquant/models/gemma-4-E4B-it-Q4_K_M.gguf -ngl 99 -ngld 99 -fa on -c 16384 --host 127.0.0.1 --port 8080KV Cache + MTP-HEAD :
./build/bin/llama-server -m ./models/gemma-4-E4B-it-Q4_K_M.gguf --mtp-head ./models/gemma-4-E4B-it-assistant.Q4_K_M.gguf --spec-type mtp --draft-block-size 3 --draft-max 8 --draft-min 0 -ngl 99 -ngld 99 -ctk turbo3 -ctvd turbo3 -fa on -c 16384 --host 127.0.0.1 --port 8080MTP-HEAD Only :
./build/bin/llama-server -m ./models/gemma-4-E4B-it-Q4_K_M.gguf --mtp-head ./models/gemma-4-E4B-it-assistant.Q4_K_M.gguf --spec-type mtp --draft-block-size 3 --draft-max 8 --draft-min 0 -ngl 99 -ngld 99 -fa on -c 16384 --host 127.0.0.1 --port 8080;KV Cache Only :
./build/bin/llama-server -m ./models/gemma-4-E4B-it-Q4_K_M.gguf -ngl 99 -ngld 99 -ctk turbo3 -ctvd turbo3 -fa on -c 16384 --host 127.0.0.1 --port 8080;All test are runs with :
Requirements