feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3)#5
feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3)#5apollosenvy wants to merge 9 commits intoTheTom:feature/turboquant-kv-cachefrom
Conversation
Complete experiment log: TheTom#1 4-mag LUT: 15.1 at 8K (BEST, +38%) TheTom#2 Batched extract: 13.7 (+25%) TheTom#3 Inline FA block: 13.5 (I-cache pressure) TheTom#4 Deferred norm: 12.9 (loses ILP) TheTom#5 2-pair half2: 12.0 (ternary overhead) TheTom#6 Select chain: 11.9 (branches kill) TheTom#7 Bit-arithmetic: 11.6 (ALU too heavy) TheTom#8 FMA branchless: 11.4 (ALU still too heavy) TheTom#9 Named-reg ternary: 10.3 (branches worst) TheTom#10 Main (8-LUT): 10.95 (baseline) TheTom#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
|
this is seriously impressive work, full ROCm backend in one PR. and thanks for finding the K/V rotation issue in cpy_k/cpy_v. our Metal path handles rotation differently (graph-side Q rotation + inverse rotation after attention, rather than rotating K/V before cache write), so it may not be the same bug, but i want to verify. will test locally on M5 Max this afternoon to check for regressions before merging. would also appreciate it if you could run the PPL gate when you get a chance: |
Test Results — M5 Max 128GB, Qwen3.5-35B-A3B Q8_0Build: Clean compile on Metal.
|
|
Good catch on the double-rotation. Pushed a fix: Running PPL gate on our end now (wikitext-2-raw, turbo3 vs FP16 baseline on Qwen3.5-27B Q4_K_M, gfx1100). Will post results when done. Re: the longer-term architecture question (graph vs kernel rotation) — happy to discuss. For now the preprocessor gate is minimal and correct for both paths. If you prefer moving Metal's kernel rotation into the graph to unify the approach, we can help with that in a follow-up. |
PPL Gate ResultsModel: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Q4_K_M
The +28.8% regression is higher than expected (~1% in your Metal benchmarks). This model is already Q4_K_M weight-quantized, so turbo3 KV quantization stacks on top of existing weight quantization loss. Would be useful to compare with an F16/BF16 weight model to isolate the KV-only contribution. The Metal regression should be zero with the preprocessor gate since the graph-side rotation is now compiled out for Metal builds. |
Multi-model PPL Gate UpdateAttempted turbo3 PPL tests on all available GGUFs. Results:
Finding: turbo3 currently only works reliably on Qwen3.5 (head_dim=256). Models with head_dim=128 crash during KV cache initialization. This is likely because the WHT rotation group size (128) matches head_dim exactly, causing edge cases in the graph-side cast/rotation code. The Qwen3.5 result (PPL +29%) is higher than expected for turbo3, but the model is already Q4_K_M weight-quantized. An F16/BF16 weight model would show the true turbo3-only regression. Will investigate the head_dim=128 compatibility issue next. |
Adaptive WHT + Multi-Head-Dim SupportPushed What was brokenThree bugs prevented turbo3 from working on anything other than Qwen3.5 (head_dim=256):
Fix: Adaptive Group SizeWHT group_size now computed as largest power of 2 dividing head_dim, capped at 128:
Sign arrays truncated to first Benchmark Results (7900 XTX, ROCm 7.1)
PPL Investigation (head_dim=128)Mistral-Small PPL with turbo3 is catastrophic (~15000 vs 5.16 F16 baseline). Tested both with and without WHT rotation -- PPL is nearly identical either way, confirming the issue is the 3-bit centroid quantization itself, not the rotation. The centroids may need model-family-specific calibration, or head_dim=128 models may need 4-bit (turbo4) for acceptable quality. Q8_0 KV cache on the same model gives PPL 5.17 (matching F16), so the model's attention values are perfectly quantizable -- just not at 3 bits. Next: investigating whether turbo4 (4-bit) resolves the quality gap for head_dim=128 models. |
|
thanks for the thorough work here, especially the adaptive WHT fix and tracking down the dead rotation matrices exhausting the context pool. the multi-model compatibility sweep is exactly the kind of testing that catches real issues. i'm currently tied up with some other validation runs but will get to testing and review once my bench frees up. the head_dim=128 PPL being catastrophic at turbo3 lines up with what buun's been seeing on the CUDA side too — turbo4 path for head_dim=128 is likely the right direction. will follow up once i can give this a proper look. |
Root Cause: PolarQuant Reconstruction Model vs Uniform QuantizationDug deeper into the head_dim=128 PPL catastrophe. It's NOT a bit depth problem:
turbo4 has MORE bits than q4_0 but 1000x worse PPL. The issue is PolarQuant's reconstruction model: Tried recalibrating centroids for the actual unit-sphere distribution (d=32) -- the current centroids are tuned for N(0,1/128) which has 2x narrower range than the per-block-normalized distribution. Wider centroids didn't help (PPL got worse, possibly due to CPU/GPU centroid mismatch on layer 0). The fundamental issue: Mistral-Small's attention heads have non-uniform value distributions where the PolarQuant assumption (values are roughly equi-distributed around the block centroid) breaks down. Some dimensions carry disproportionate signal. Possible fixes:
Also pushed turbo4 CUDA/HIP implementation (works on Qwen3.5-27B, PPL 2.28) and corrected centroids analysis to the branch. Holding push until you review. |
|
apollosenvy KV Type Bits/val Mistral-Small PPL Reconstruction Model Tried recalibrating centroids for the actual unit-sphere distribution (d=32) -- the current centroids are tuned for N(0,1/128) which has 2x narrower range than the per-block-normalized distribution. Wider centroids didn't help (PPL got worse, possibly due to CPU/GPU centroid mismatch on layer 0). The fundamental issue: Mistral-Small's attention heads have non-uniform value distributions where the PolarQuant assumption (values are roughly equi-distributed around the block centroid) breaks down. Some dimensions carry disproportionate signal. Possible fixes: Hybrid quantization: Use turbo3/turbo4 for Qwen-family models (where it works), fall back to q4_0/q4_1 for others |
|
Two things here: On the adaptive WHT fix (2c3bb08) — really solid work. The dead rotation matrices exhausting the context pool is something I’d hit as well but hadn’t pushed a fix for on CUDA/HIP yet. CPU dup handler and adaptive group sizing are nice catches. Good to see hd64/96/128 actually running. On the PolarQuant reconstruction analysis — I’d be careful about calling it a fundamental issue just yet. There are still pipeline-level bugs that can produce exactly the kind of catastrophic PPL you’re seeing. Two data points from my side:
I just finished bringing turbo4 back up on Metal and ended up finding several bugs that massively distorted PPL (one SET_ROWS packing issue alone took it from ~679 down to ~6.1). So I’d focus on ruling out pipeline issues first. A few specific things I’d check:
Also, your turbo4 CUDA/HIP result on Qwen (PPL ~2.28) is actually a strong signal that the reconstruction model itself is sound when the pipeline is correct. The interesting question is what differs between that path and the Mistral one beyond head_dim. Happy to take a look once you’ve had a chance to cross-check — no rush. |
|
Please sync to TOT and I'll review and test. |
…rnels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Synced to TOT — HIP port on
|
| Model | KV Type | PPL | vs F16 |
|---|---|---|---|
| Mistral-Small-24B (hd128) | F16 | 5.16 | baseline |
| Mistral-Small-24B (hd128) | turbo3 | 5.28 | +2.4% |
| Mistral-Small-24B (hd128) | q4_0 | 5.26 | +1.9% |
PolarQuant IS sound. The catastrophic PPL was 100% the CPU quantize stub zeroing qs/signs on layer 0. My earlier "fundamental limitation" analysis was wrong — thanks for pushing back on that.
Branch is rocm-turbo3-v2 on fork. Ready for review.
Two-dimensional KV cache compression: static per-layer type selection (vertical, existing TURBO_LAYER_ADAPTIVE) combined with dynamic per-position quality degradation (horizontal, new TURBO_DECAY). Old tokens get their QJL refinement bits zeroed in-place, reducing effective precision without changing the ggml tensor type. Tier boundaries shift dynamically with context length. Re-quantization piggybacks on SET_ROWS dispatch -- one memset per block per token. Three presets: conservative (25/40/35), balanced (15/35/50), aggressive (5/25/70). Custom percentages via TURBO_DECAY=H,W,C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Task 1: Config struct + env var parsing Task 2: ggml TURBO_DECAY op declaration Task 3: CUDA/HIP/CPU decay kernel (signs zeroing) Task 4: Wire into cpy_k/cpy_v graph dispatch Task 5: PPL validation benchmarks Task 6: Long-context stress test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- turbo3 signs field is 3rd centroid bit, not QJL. Zeroing reduces 8-level to 4-level reconstruction (corrected description) - turbo4 has two compile-time variants: 4-bit (no QJL, skip decay) vs legacy 3-bit+QJL (zero rnorm). Both documented. - Boundaries computed in position space, not cell index space (ring buffer ordering is non-monotonic) - Multi-sequence policy: conservative (cold for ALL sequences) Phase 1 single-sequence only. - VRAM claim removed (decay doesn't reduce VRAM, it improves quality-at-equal-VRAM) - Added edge cases: SWA, seq_rm, defrag, serialization, idempotency - Clarified graph execution model (ggml op, not side-effect) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the configuration infrastructure for TurboQuant+ temporal decay: - turbo_decay_config struct (public, in llama_kv_cache) with hot/warm/cold percentage splits and hot_promote flag - parse_turbo_decay() static function reads TURBO_DECAY env var with named presets (conservative/balanced/aggressive) or custom h,w,c splits - TURBO_DECAY_HOT_PROMOTE env var for phase 2 q8_0 promotion - Per-layer last_cold_boundary tracking vector initialized in constructor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the GGML_OP_TURBO_DECAY enum value, ggml_turbo_decay() factory function, and name/symbol table entries. The op records a position range [cold_start, cold_end) in op_params; the backend kernel (Task 3) will zero the signs field of turbo3/turbo4 KV cache blocks in that range. GGML_OP_COUNT static_assert bumped to 98. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GGML_OP_TURBO_DECAY backend kernels: - CUDA/HIP: k_turbo3_decay zeros signs[] to collapse 3-bit to 2-bit reconstruction; k_turbo4_decay (legacy only) also zeros rnorm - CPU: single-threaded memset fallback for both turbo3 and turbo4 - Properly gated behind TURBO4_USE_4BIT for turbo4 (4-bit variant has no signs field to zero) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After writing new KV entries via ggml_set_rows, append a ggml_turbo_decay node that demotes old positions past the cold boundary. Only fires for turbo3/turbo4 caches in single-sequence mode when TURBO_DECAY is enabled. The last_cold_boundary vector is made mutable so const cpy_k/cpy_v methods can track boundary advancement per layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5-27B Q4_K_M, turbo3 KV, 7900 XTX, ctx=2048, 4 chunks: PPL: no decay: 7.58 (baseline) balanced: 8.43 (+11.2%) aggressive: 9.81 (+29.4%) Speed (pp128 / tg128): no decay: 528 / 19.07 t/s balanced: 530 / 18.62 t/s (within noise) PPL regression is expected at short context (50-70% of 2K tokens get degraded). At longer contexts the cold tier covers older tokens with lower attention weight, so the quality impact decreases. Zero measurable speed overhead from decay op. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TurboQuant+ Temporal Decay -- ImplementedPushed to How it worksThree dynamic tiers with proportional boundaries:
Boundaries shift proportionally as context grows. Demotion is a single ConfigurationTURBO_DECAY=balanced # 15/35/50 (default)
TURBO_DECAY=conservative # 25/40/35
TURBO_DECAY=aggressive # 5/25/70
TURBO_DECAY=10,30,60 # custom percentagesBenchmark (Qwen3.5-27B Q4_K_M, 7900 XTX)PPL (ctx=2048, 4 chunks):
Speed: Zero measurable overhead (within noise). The PPL hit at 2K is expected -- at short context, the cold tier covers tokens the model still actively attends to. At longer contexts (16K+), the cold tier covers much older tokens with naturally lower attention weight, so the quality impact should decrease. This is the design tradeoff: sacrifice some short-context quality for better long-context compression efficiency. Implementation (8 commits)
Phase 2 (hot tier q8_0 promotion) and Phase 3 (attention-guided decay) designed but deferred. |
bf908c0 to
a8a2f31
Compare
|
Tested on M5 Max 128GB (Metal) and M2 Pro 32GB (Metal). Thanks for the work here, some notes below. What passes (head_dim=128 models)PPL and speed are clean for standard models:
Build passes on Metal with no warnings. Issue 1: TURBO_DECAY crashes on MetalThe temporal decay op has CUDA/HIP/CPU kernels but no Metal kernel. When Repro: TURBO_DECAY=balanced ./build/bin/llama-cli \
-m model.gguf -ngl 99 -fa on \
-ctk turbo3 -ctv turbo3 \
-p "Write a 500 word essay about the history of computing" -n 500This is a crash, not a graceful fallback. The decay op needs either a Metal kernel or a Issue 2: Boundary V modes 5/6/7 removedThe PR removes layer-adaptive modes 5, 6, and 7 (Boundary V) from the constructor. These were shipped in PR #30 and are part of our current feature set. SuggestionConsider splitting this into separate PRs:
Happy to re-review individual pieces. The ROCm work is great and would be straightforward to land on its own. |
|
Split per your suggestion:
Re: modes 5/6/7 — we didn't remove them. Our branch was based on a commit before PR #30. The new PR #31 is rebased on your latest which includes Boundary V, so modes 5/6/7 are intact. |
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Summary
Adds complete ROCm/HIP backend support for turbo3 KV cache quantization, enabling 4.6x KV compression on AMD GPUs.
Tested on: AMD Radeon RX 7900 XTX (gfx1100), ROCm 7.1, Qwen3.5-27B Q4_K_M
What's included
need_f16_K/Vconversion path inlaunch_fattn.Critical bugfix: K/V rotation before cache write
Found and fixed a quality issue: K and V vectors were being quantized to turbo3 without WHT rotation. The graph rotated Q (forward) and inverse-rotated the attention output, but the K/V write path in
cpy_k()/cpy_v()was missing the forward rotation step.Before fix:
<accepts high pressure "Itself" - ount>& & nsp; a- xii'After fix:
7 x 8 = 56. This is one of the most common multiplication facts.This fix is in
src/llama-kv-cache.cppand may also benefit the Metal path if it has the same omission.Performance
*Generation speed regression is from
AMD_SERIALIZE_KERNEL=3requirement (see below).Known limitation
Requires
AMD_SERIALIZE_KERNEL=3environment variable. Without it, the multi-stream graph scheduler has a race condition between KV cache write and FA read ops. Per-opcudaStreamSynchronizeis not sufficient. This is a ROCm runtime issue, not a turbo3 bug.Files changed (24 files, +595 lines)
New files (5):
dequantize-turbo.cuh- Device dequant (centroid LUT)vecdot-turbo.cuh- MMVQ vec_dotturbo-wht.cu/cuh- WHT butterfly kernelfattn-vec-instance-turbo3_0-turbo3_0.cu- FA template (D=256)Modified (19): common.cuh, mmvq.cu, fattn-common.cuh, fattn-vec.cuh, fattn.cu, convert.cu, cpy-utils.cuh, cpy.cu, set-rows.cu, getrows.cu, ggml-cuda.cu, CMakeLists.txt (cuda+hip), ggml-turbo-quant.c, ggml-quants.h, ggml-cpu.c, quants.h, llama-kv-cache.cpp
Test plan
--cache-type-k turbo3 --cache-type-v turbo3