feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3) by apollosenvy · Pull Request #5 · TheTom/llama-cpp-turboquant

apollosenvy · 2026-03-27T11:13:57Z

Summary

Adds complete ROCm/HIP backend support for turbo3 KV cache quantization, enabling 4.6x KV compression on AMD GPUs.

Tested on: AMD Radeon RX 7900 XTX (gfx1100), ROCm 7.1, Qwen3.5-27B Q4_K_M

What's included

GPU kernels: dequant, vec_dot, Walsh-Hadamard Transform, quantize, GET_ROWS, CPY
Flash attention: F16 dequant bridge (turbo3 -> F16 temp buffer -> F16 FA kernel). The native turbo3 FA template compiles but produces broken kernels on HIP/RDNA3, so we route through the existing need_f16_K/V conversion path in launch_fattn.
CPU fallback: vec_dot + type traits for graph scheduler fallback paths
All dispatch wiring: supports_op for MUL_MAT, GET_ROWS, SET_ROWS, CPY, FLASH_ATTN_EXT, TURBO_WHT

Critical bugfix: K/V rotation before cache write

Found and fixed a quality issue: K and V vectors were being quantized to turbo3 without WHT rotation. The graph rotated Q (forward) and inverse-rotated the attention output, but the K/V write path in cpy_k()/cpy_v() was missing the forward rotation step.

Before fix: <accepts high pressure "Itself" - ount>& & nsp; a- xii'
After fix: 7 x 8 = 56. This is one of the most common multiplication facts.

This fix is in src/llama-kv-cache.cpp and may also benefit the Metal path if it has the same omission.

Performance

Metric	FP16 KV	turbo3 KV	Ratio
Prompt	90 tok/s	82 tok/s	91%
Generation	21 tok/s	12 tok/s	57%*
KV cache (32K ctx)	2048 MB	448 MB	4.6x

*Generation speed regression is from AMD_SERIALIZE_KERNEL=3 requirement (see below).

Known limitation

Requires AMD_SERIALIZE_KERNEL=3 environment variable. Without it, the multi-stream graph scheduler has a race condition between KV cache write and FA read ops. Per-op cudaStreamSynchronize is not sufficient. This is a ROCm runtime issue, not a turbo3 bug.

Files changed (24 files, +595 lines)

New files (5):

dequantize-turbo.cuh - Device dequant (centroid LUT)
vecdot-turbo.cuh - MMVQ vec_dot
turbo-wht.cu/cuh - WHT butterfly kernel
fattn-vec-instance-turbo3_0-turbo3_0.cu - FA template (D=256)

Modified (19): common.cuh, mmvq.cu, fattn-common.cuh, fattn-vec.cuh, fattn.cu, convert.cu, cpy-utils.cuh, cpy.cu, set-rows.cu, getrows.cu, ggml-cuda.cu, CMakeLists.txt (cuda+hip), ggml-turbo-quant.c, ggml-quants.h, ggml-cpu.c, quants.h, llama-kv-cache.cpp

Test plan

Standalone GPU kernel test (quantize/dequant round-trip)
Server starts with --cache-type-k turbo3 --cache-type-v turbo3
KV cache allocates at expected compression ratio
Inference produces coherent output
Math reasoning correct (7*8=56)
Perplexity benchmark vs upstream Metal numbers
Multi-turn conversation stability

Complete experiment log: TheTom#1 4-mag LUT: 15.1 at 8K (BEST, +38%) TheTom#2 Batched extract: 13.7 (+25%) TheTom#3 Inline FA block: 13.5 (I-cache pressure) TheTom#4 Deferred norm: 12.9 (loses ILP) TheTom#5 2-pair half2: 12.0 (ternary overhead) TheTom#6 Select chain: 11.9 (branches kill) TheTom#7 Bit-arithmetic: 11.6 (ALU too heavy) TheTom#8 FMA branchless: 11.4 (ALU still too heavy) TheTom#9 Named-reg ternary: 10.3 (branches worst) TheTom#10 Main (8-LUT): 10.95 (baseline) TheTom#11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

TheTom · 2026-03-27T17:59:47Z

this is seriously impressive work, full ROCm backend in one PR. and thanks for finding the K/V rotation issue in cpy_k/cpy_v. our Metal path handles rotation differently (graph-side Q rotation + inverse rotation after attention, rather than rotating K/V before cache write), so it may not be the same bug, but i want to verify.

will test locally on M5 Max this afternoon to check for regressions before merging. would also appreciate it if you could run the PPL gate when you get a chance:

TheTom · 2026-03-27T20:13:05Z

Test Results — M5 Max 128GB, Qwen3.5-35B-A3B Q8_0

Build: Clean compile on Metal.

⚠️ turbo3 — CRITICAL REGRESSION

PPL: 181.60 (expected: 6.18). Completely broken on Metal. Do not merge.

Root cause: The cpy_k()/cpy_v() changes add WHT rotation in the graph before cache write. On Metal, kernel_set_rows_turbo3 already performs WHT rotation during quantization (line ~503 in ggml-metal.metal). This causes K/V to be double-rotated, destroying quality.

The fix is correct for ROCm/HIP where SET_ROWS does not rotate, but it needs to be backend-gated:

// Only add graph-side rotation for backends where SET_ROWS doesn't rotate internally
if (k->type == GGML_TYPE_TURBO3_0 || k->type == GGML_TYPE_TURBO4_0) {
    if (/* backend is CUDA/HIP, not Metal */) {
        k_cur = ggml_turbo_wht(ctx, k_cur, 0);
    }
}

Or alternatively, remove the rotation from the Metal SET_ROWS kernel and do it in the graph for all backends (cleaner long-term, but requires Metal kernel changes).

ROCm code review (code-only, can't run on this hardware)

GPU kernels look structurally correct (dequant, vec_dot, WHT, quantize, GET_ROWS, CPY)
F16 dequant bridge for FA is a reasonable workaround for HIP FA template issues
CPU fallback additions are correct
AMD_SERIALIZE_KERNEL=3 requirement is documented

Recommendation

Do not merge as-is. Breaks Metal turbo3 (PPL 6.18 → 181.6). The rotation needs to be backend-gated, or the architecture needs to agree on where rotation happens (graph vs kernel). Happy to help work through the fix.

apollosenvy · 2026-03-27T20:26:50Z

Good catch on the double-rotation. Pushed a fix: cpy_k()/cpy_v() rotation is now wrapped in #if !defined(GGML_USE_METAL) so it only applies on CUDA/HIP/CPU where SET_ROWS doesn't rotate internally.

Running PPL gate on our end now (wikitext-2-raw, turbo3 vs FP16 baseline on Qwen3.5-27B Q4_K_M, gfx1100). Will post results when done.

Re: the longer-term architecture question (graph vs kernel rotation) — happy to discuss. For now the preprocessor gate is minimal and correct for both paths. If you prefer moving Metal's kernel rotation into the graph to unify the approach, we can help with that in a follow-up.

apollosenvy · 2026-03-27T20:53:01Z

PPL Gate Results

Model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Q4_K_M
Hardware: AMD 7900 XTX (gfx1100), ROCm 7.1
Dataset: wikitext-2-raw-v1, ctx=2048
Build: With #if !defined(GGML_USE_METAL) backend gate applied

KV Cache	PPL	vs FP16	Compression
FP16	6.809 ± 0.046	baseline	1x
Q8_0	6.817 ± 0.046	+0.1%	2x
turbo3	8.771 ± 0.064	+28.8%	4.6x

The +28.8% regression is higher than expected (~1% in your Metal benchmarks). This model is already Q4_K_M weight-quantized, so turbo3 KV quantization stacks on top of existing weight quantization loss. Would be useful to compare with an F16/BF16 weight model to isolate the KV-only contribution.

The Metal regression should be zero with the preprocessor gate since the graph-side rotation is now compiled out for Metal builds.

apollosenvy · 2026-03-27T22:38:38Z

Multi-model PPL Gate Update

Attempted turbo3 PPL tests on all available GGUFs. Results:

Model	Head Dim	Weight Quant	FP16 PPL	turbo3 PPL	Status
Qwen3.5-27B-Distill	256	Q4_K_M	6.81	8.77 (+29%)	WORKS
Mistral-Small-24B	128	Q4_K_S	4.99	-	Crash at init
GPT-OSS-20B	128?	GGUF	466.9	118714	Runs but terrible
TinyLlama 1.1B	64	Q8_0	14.67	-	Crash (head_dim < 128)
Phi-3 Mini	96	Q4	5.76	-	Crash (head_dim < 128)
Qwen3-8B	128	F16	9.65	-	Crash

Finding: turbo3 currently only works reliably on Qwen3.5 (head_dim=256). Models with head_dim=128 crash during KV cache initialization. This is likely because the WHT rotation group size (128) matches head_dim exactly, causing edge cases in the graph-side cast/rotation code.

The Qwen3.5 result (PPL +29%) is higher than expected for turbo3, but the model is already Q4_K_M weight-quantized. An F16/BF16 weight model would show the true turbo3-only regression.

Will investigate the head_dim=128 compatibility issue next.

apollosenvy · 2026-03-28T02:25:30Z

Adaptive WHT + Multi-Head-Dim Support

Pushed 2c3bb08: fixes turbo3 crashes on all head_dim models.

What was broken

Three bugs prevented turbo3 from working on anything other than Qwen3.5 (head_dim=256):

Dead dense rotation matrices (128x128 turbo_rotation/turbo_rotation_inv) from the abandoned graph-side matmul approach exhausted the ggml context object pool. Every model crashed at init.
CPU dup handler didn't support turbo3 -> F16 cast. The scheduler routes some ops to CPU even with full GPU offload, hitting GGML_ABORT in the dup switch.
WHT hardcoded to 128-element groups. Models with head_dim < 256 either crashed (128) or skipped rotation entirely (64, 96).

Fix: Adaptive Group Size

WHT group_size now computed as largest power of 2 dividing head_dim, capped at 128:

head_dim	group_size	groups/head	butterfly stages
256	128	2	7 (unchanged)
128	128	1	7
96	32	3	5
64	64	1	6

Sign arrays truncated to first gs elements. Normalization: 1/sqrt(gs). All backends updated (CUDA/HIP, CPU). Metal unchanged (uses internal SET_ROWS rotation).

Benchmark Results (7900 XTX, ROCm 7.1)

Model	Head Dim	Status	pp128 t/s	tg32 t/s
Qwen3.5-27B Q4_K_M	256	No regression	662	24.7
Mistral-Small-24B Q4_K_S	128	Fixed (was crash)	600	24.2
TinyLlama-1.1B Q8_0	64	Fixed (was crash)	745	130

PPL Investigation (head_dim=128)

Mistral-Small PPL with turbo3 is catastrophic (~15000 vs 5.16 F16 baseline). Tested both with and without WHT rotation -- PPL is nearly identical either way, confirming the issue is the 3-bit centroid quantization itself, not the rotation. The centroids may need model-family-specific calibration, or head_dim=128 models may need 4-bit (turbo4) for acceptable quality.

Q8_0 KV cache on the same model gives PPL 5.17 (matching F16), so the model's attention values are perfectly quantizable -- just not at 3 bits.

Next: investigating whether turbo4 (4-bit) resolves the quality gap for head_dim=128 models.

TheTom · 2026-03-28T16:44:20Z

thanks for the thorough work here, especially the adaptive WHT fix and tracking down the dead rotation matrices exhausting the context pool. the multi-model compatibility sweep is exactly the kind of testing that catches real issues.

i'm currently tied up with some other validation runs but will get to testing and review once my bench frees up. the head_dim=128 PPL being catastrophic at turbo3 lines up with what buun's been seeing on the CUDA side too — turbo4 path for head_dim=128 is likely the right direction.

will follow up once i can give this a proper look.

apollosenvy · 2026-03-28T20:14:42Z

Root Cause: PolarQuant Reconstruction Model vs Uniform Quantization

Dug deeper into the head_dim=128 PPL catastrophe. It's NOT a bit depth problem:

KV Type	Bits/val	Mistral-Small PPL	Reconstruction Model
F16	16	5.16	exact
q8_0	8	5.17	scale * idx
q5_0	5	5.18	scale * idx + offset
q4_0	4	5.26	scale * idx
q4_1	4.5	5.21	scale * idx + min
turbo4	4.25	5574	*norm centroid[idx]**
turbo3	3.5	15518	*norm centroid[idx]**

turbo4 has MORE bits than q4_0 but 1000x worse PPL. The issue is PolarQuant's reconstruction model: value = norm * centroid[idx]. This assumes all elements in a block share the same scale (the block L2 norm) and the centroids are at fixed positions. Uniform quantization (q4_0) adapts its scale and zero point per block.

Tried recalibrating centroids for the actual unit-sphere distribution (d=32) -- the current centroids are tuned for N(0,1/128) which has 2x narrower range than the per-block-normalized distribution. Wider centroids didn't help (PPL got worse, possibly due to CPU/GPU centroid mismatch on layer 0).

The fundamental issue: Mistral-Small's attention heads have non-uniform value distributions where the PolarQuant assumption (values are roughly equi-distributed around the block centroid) breaks down. Some dimensions carry disproportionate signal.

Possible fixes:

Hybrid quantization: Use turbo3/turbo4 for Qwen-family models (where it works), fall back to q4_0/q4_1 for others
Adaptive reconstruction: Replace norm * centroid[idx] with scale * centroid[idx] + offset (adds 2 bytes per block but adapts like q4)
Per-head calibration: Different centroid sets per attention head

Also pushed turbo4 CUDA/HIP implementation (works on Qwen3.5-27B, PPL 2.28) and corrected centroids analysis to the branch. Holding push until you review.

TheTom · 2026-03-28T20:20:37Z

apollosenvy
commented
1 minute ago
Root Cause: PolarQuant Reconstruction Model vs Uniform Quantization
Dug deeper into the head_dim=128 PPL catastrophe. It's NOT a bit depth problem:

KV Type Bits/val Mistral-Small PPL Reconstruction Model
F16 16 5.16 exact
q8_0 8 5.17 scale * idx
q5_0 5 5.18 scale * idx + offset
q4_0 4 5.26 scale * idx
q4_1 4.5 5.21 scale * idx + min
turbo4 4.25 5574 norm * centroid[idx]
turbo3 3.5 15518 norm * centroid[idx]
turbo4 has MORE bits than q4_0 but 1000x worse PPL. The issue is PolarQuant's reconstruction model: value = norm * centroid[idx]. This assumes all elements in a block share the same scale (the block L2 norm) and the centroids are at fixed positions. Uniform quantization (q4_0) adapts its scale and zero point per block.

Tried recalibrating centroids for the actual unit-sphere distribution (d=32) -- the current centroids are tuned for N(0,1/128) which has 2x narrower range than the per-block-normalized distribution. Wider centroids didn't help (PPL got worse, possibly due to CPU/GPU centroid mismatch on layer 0).

The fundamental issue: Mistral-Small's attention heads have non-uniform value distributions where the PolarQuant assumption (values are roughly equi-distributed around the block centroid) breaks down. Some dimensions carry disproportionate signal.

Possible fixes:

Hybrid quantization: Use turbo3/turbo4 for Qwen-family models (where it works), fall back to q4_0/q4_1 for others
Adaptive reconstruction: Replace norm * centroid[idx] with scale * centroid[idx] + offset (adds 2 bytes per block but adapts like q4)
Per-head calibration: Different centroid sets per attention head
Also pushed turbo4 CUDA/HIP implementation (works on Qwen3.5-27B, PPL 2.28) and corrected centroids analysis to the branch. Holding push until you review.

TheTom · 2026-03-28T20:23:21Z

Two things here:

On the adaptive WHT fix (2c3bb08) — really solid work. The dead rotation matrices exhausting the context pool is something I’d hit as well but hadn’t pushed a fix for on CUDA/HIP yet. CPU dup handler and adaptive group sizing are nice catches. Good to see hd64/96/128 actually running.

On the PolarQuant reconstruction analysis — I’d be careful about calling it a fundamental issue just yet. There are still pipeline-level bugs that can produce exactly the kind of catastrophic PPL you’re seeing.

Two data points from my side:

On Qwen3.5-27B, your turbo3 is +28.8% vs q8_0. My runs on the same model family are ~+1%. That’s a huge gap on a model where turbo is known to behave well, which suggests something upstream of reconstruction is off.
buun is seeing turbo4 at ~+1.2% vs q8_0 on head_dim=128 models on CUDA. So the hd128 gap seems real but small (1–3%), not catastrophic.

I just finished bringing turbo4 back up on Metal and ended up finding several bugs that massively distorted PPL (one SET_ROWS packing issue alone took it from ~679 down to ~6.1). So I’d focus on ruling out pipeline issues first.

A few specific things I’d check:

SET_ROWS writing the correct turbo4 block layout (shared turbo3/turbo4 paths can easily corrupt data)
CPU/GPU centroid mismatch (especially if layer 0 falls back to CPU)
turbo4 pre-rotation guard (buun flagged a missing check for TURBO4_0)

Also, your turbo4 CUDA/HIP result on Qwen (PPL ~2.28) is actually a strong signal that the reconstruction model itself is sound when the pipeline is correct. The interesting question is what differs between that path and the Mistral one beyond head_dim.

Happy to take a look once you’ve had a chance to cross-check — no rush.

TheTom · 2026-03-29T02:49:39Z

Please sync to TOT and I'll review and test.

…rnels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

apollosenvy · 2026-03-29T04:31:40Z

Synced to TOT — HIP port on `rocm-turbo3-v2`

Rebased onto your latest feature/turboquant-kv-cache (172fc85). Clean single commit: HIP/ROCm porting for your warp-cooperative turbo3 kernel + turbo2/turbo3 FA templates.

What was needed for HIP

Vendor header (hip.h): Added cudaMemcpyToSymbol/FromSymbol mappings. Fixed __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support 3-arg calls (CUDA defaults width to warpSize, old HIP macros required 4 args). Added __ballot_sync -> __ballot with uint32_t cast.
HIP CMakeLists: Added turbo3 and turbo2 FA template instances (turbo3-turbo3, turbo3-q8_0, q8_0-turbo3, turbo2-turbo2, turbo2-q8_0, q8_0-turbo2).

PPL Confirmed

You were right about the pipeline bugs. With the CPU quantize stub fix (53f1298):

Model	KV Type	PPL	vs F16
Mistral-Small-24B (hd128)	F16	5.16	baseline
Mistral-Small-24B (hd128)	turbo3	5.28	+2.4%
Mistral-Small-24B (hd128)	q4_0	5.26	+1.9%

PolarQuant IS sound. The catastrophic PPL was 100% the CPU quantize stub zeroing qs/signs on layer 0. My earlier "fundamental limitation" analysis was wrong — thanks for pushing back on that.

Branch is rocm-turbo3-v2 on fork. Ready for review.

Two-dimensional KV cache compression: static per-layer type selection (vertical, existing TURBO_LAYER_ADAPTIVE) combined with dynamic per-position quality degradation (horizontal, new TURBO_DECAY). Old tokens get their QJL refinement bits zeroed in-place, reducing effective precision without changing the ggml tensor type. Tier boundaries shift dynamically with context length. Re-quantization piggybacks on SET_ROWS dispatch -- one memset per block per token. Three presets: conservative (25/40/35), balanced (15/35/50), aggressive (5/25/70). Custom percentages via TURBO_DECAY=H,W,C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Task 1: Config struct + env var parsing Task 2: ggml TURBO_DECAY op declaration Task 3: CUDA/HIP/CPU decay kernel (signs zeroing) Task 4: Wire into cpy_k/cpy_v graph dispatch Task 5: PPL validation benchmarks Task 6: Long-context stress test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- turbo3 signs field is 3rd centroid bit, not QJL. Zeroing reduces 8-level to 4-level reconstruction (corrected description) - turbo4 has two compile-time variants: 4-bit (no QJL, skip decay) vs legacy 3-bit+QJL (zero rnorm). Both documented. - Boundaries computed in position space, not cell index space (ring buffer ordering is non-monotonic) - Multi-sequence policy: conservative (cold for ALL sequences) Phase 1 single-sequence only. - VRAM claim removed (decay doesn't reduce VRAM, it improves quality-at-equal-VRAM) - Added edge cases: SWA, seq_rm, defrag, serialization, idempotency - Clarified graph execution model (ggml op, not side-effect) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds the configuration infrastructure for TurboQuant+ temporal decay: - turbo_decay_config struct (public, in llama_kv_cache) with hot/warm/cold percentage splits and hot_promote flag - parse_turbo_decay() static function reads TURBO_DECAY env var with named presets (conservative/balanced/aggressive) or custom h,w,c splits - TURBO_DECAY_HOT_PROMOTE env var for phase 2 q8_0 promotion - Per-layer last_cold_boundary tracking vector initialized in constructor Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds the GGML_OP_TURBO_DECAY enum value, ggml_turbo_decay() factory function, and name/symbol table entries. The op records a position range [cold_start, cold_end) in op_params; the backend kernel (Task 3) will zero the signs field of turbo3/turbo4 KV cache blocks in that range. GGML_OP_COUNT static_assert bumped to 98. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements GGML_OP_TURBO_DECAY backend kernels: - CUDA/HIP: k_turbo3_decay zeros signs[] to collapse 3-bit to 2-bit reconstruction; k_turbo4_decay (legacy only) also zeros rnorm - CPU: single-threaded memset fallback for both turbo3 and turbo4 - Properly gated behind TURBO4_USE_4BIT for turbo4 (4-bit variant has no signs field to zero) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After writing new KV entries via ggml_set_rows, append a ggml_turbo_decay node that demotes old positions past the cold boundary. Only fires for turbo3/turbo4 caches in single-sequence mode when TURBO_DECAY is enabled. The last_cold_boundary vector is made mutable so const cpy_k/cpy_v methods can track boundary advancement per layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen3.5-27B Q4_K_M, turbo3 KV, 7900 XTX, ctx=2048, 4 chunks: PPL: no decay: 7.58 (baseline) balanced: 8.43 (+11.2%) aggressive: 9.81 (+29.4%) Speed (pp128 / tg128): no decay: 528 / 19.07 t/s balanced: 530 / 18.62 t/s (within noise) PPL regression is expected at short context (50-70% of 2K tokens get degraded). At longer contexts the cold tier covers older tokens with lower attention weight, so the quality impact decreases. Zero measurable speed overhead from decay op. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

apollosenvy · 2026-03-29T06:59:25Z

TurboQuant+ Temporal Decay -- Implemented

Pushed to rocm-turbo3-v2: position-aware KV cache quality degradation. Old tokens get their signs field zeroed in-place, reducing turbo3 from 8-level to 4-level centroid reconstruction.

How it works

Three dynamic tiers with proportional boundaries:

Hot (newest 15%): full quality
Warm (middle 35%): base turbo3 quality
Cold (oldest 50%): signs zeroed, 4-level reconstruction

Boundaries shift proportionally as context grows. Demotion is a single ggml_turbo_decay op appended to the graph after SET_ROWS. One kernel launch per layer per boundary advance.

Configuration

TURBO_DECAY=balanced          # 15/35/50 (default)
TURBO_DECAY=conservative      # 25/40/35
TURBO_DECAY=aggressive        # 5/25/70
TURBO_DECAY=10,30,60          # custom percentages

Benchmark (Qwen3.5-27B Q4_K_M, 7900 XTX)

PPL (ctx=2048, 4 chunks):

Mode	PPL	vs No-Decay
off	7.58	baseline
balanced	8.43	+11.2%
aggressive	9.81	+29.4%

Speed: Zero measurable overhead (within noise).

The PPL hit at 2K is expected -- at short context, the cold tier covers tokens the model still actively attends to. At longer contexts (16K+), the cold tier covers much older tokens with naturally lower attention weight, so the quality impact should decrease. This is the design tradeoff: sacrifice some short-context quality for better long-context compression efficiency.

Implementation (8 commits)

Design spec + reviewer feedback fixes
Implementation plan
Config struct + env var parsing
GGML_OP_TURBO_DECAY op
CUDA/HIP/CPU decay kernels
Wire into cpy_k/cpy_v graph
PPL validation
Push

Phase 2 (hot tier q8_0 promotion) and Phase 3 (attention-guided decay) designed but deferred.

TheTom · 2026-03-30T00:02:55Z

Tested on M5 Max 128GB (Metal) and M2 Pro 32GB (Metal). Thanks for the work here, some notes below.

What passes (head_dim=128 models)

PPL and speed are clean for standard models:

Config	PPL (8-chunk)	pp512 t/s	tg128 t/s	Known baseline
turbo3	6.1756	2728	75.14	6.1756 (match)
turbo4	6.1250	—	—	6.1250 (match)

Build passes on Metal with no warnings.

Issue 1: TURBO_DECAY crashes on Metal

The temporal decay op has CUDA/HIP/CPU kernels but no Metal kernel. When TURBO_DECAY=balanced is set with a GPU-offloaded model, it aborts:

pre-allocated tensor (cache_k_l3 (view) (view)) in a buffer (MTL0) that cannot run the operation (TURBO_DECAY)

Repro:

TURBO_DECAY=balanced ./build/bin/llama-cli \
  -m model.gguf -ngl 99 -fa on \
  -ctk turbo3 -ctv turbo3 \
  -p "Write a 500 word essay about the history of computing" -n 500

This is a crash, not a graceful fallback. The decay op needs either a Metal kernel or a supports_op guard that prevents dispatch to the Metal backend. As-is, any Metal user who sets the env var will hit an abort.

Issue 2: Boundary V modes 5/6/7 removed

The PR removes layer-adaptive modes 5, 6, and 7 (Boundary V) from the constructor. These were shipped in PR #30 and are part of our current feature set. TURBO_LAYER_ADAPTIVE=5 no longer produces a "Boundary V" log line, confirming the code path is gone.

Suggestion

Consider splitting this into separate PRs:

ROCm/HIP kernel support (the core contribution, looks solid)
Temporal decay (experimental feature, needs Metal kernel + more validation)
Non-128 head_dim q8_0 fallback (architectural decision worth discussing separately)

Happy to re-review individual pieces. The ROCm work is great and would be straightforward to land on its own.

apollosenvy · 2026-03-30T00:21:07Z

Split per your suggestion:

PR feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX) #31 — ROCm/HIP port only. Single commit, minimal. Rebased on your latest (includes Boundary V). Also excludes D>=576 fattn-tile instances that exceed HIP's 64KB local memory limit.
Temporal decay — Will open as separate PR once Metal kernel or supports_op guard is added. The decay op shouldn't dispatch to backends that don't implement it.
head_dim fallback — Your commit 4c4511c14 already handles this. No separate PR needed from our side.

Re: modes 5/6/7 — we didn't remove them. Our branch was based on a commit before PR #30. The new PR #31 is rebased on your latest which includes Boundary V, so modes 5/6/7 are intact.

Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

This was referenced Mar 28, 2026

Don't work on vulkan device TheTom/turboquant_plus#46

Open

TurboQuant turbo3 crashes on Apple M4 during KV cache init (ggml_new_object memory pool error) TheTom/turboquant_plus#35

Closed

Tuklus and others added 8 commits March 28, 2026 22:12

apollosenvy force-pushed the rocm-turbo3-pr branch from bf908c0 to a8a2f31 Compare March 29, 2026 23:47

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU ggml labels Mar 29, 2026

apollosenvy mentioned this pull request Mar 30, 2026

feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX) #31

Merged

Conversation

apollosenvy commented Mar 27, 2026

Summary

What's included

Critical bugfix: K/V rotation before cache write

Performance

Known limitation

Files changed (24 files, +595 lines)

Test plan

Uh oh!

TheTom commented Mar 27, 2026

Uh oh!

TheTom commented Mar 27, 2026

Test Results — M5 Max 128GB, Qwen3.5-35B-A3B Q8_0

⚠️ turbo3 — CRITICAL REGRESSION

ROCm code review (code-only, can't run on this hardware)

Recommendation

Uh oh!

apollosenvy commented Mar 27, 2026

Uh oh!

apollosenvy commented Mar 27, 2026

PPL Gate Results

Uh oh!

apollosenvy commented Mar 27, 2026

Multi-model PPL Gate Update

Uh oh!

apollosenvy commented Mar 28, 2026

Adaptive WHT + Multi-Head-Dim Support

What was broken

Fix: Adaptive Group Size

Benchmark Results (7900 XTX, ROCm 7.1)

PPL Investigation (head_dim=128)

Uh oh!

TheTom commented Mar 28, 2026

Uh oh!

apollosenvy commented Mar 28, 2026

Root Cause: PolarQuant Reconstruction Model vs Uniform Quantization

Uh oh!

TheTom commented Mar 28, 2026

Uh oh!

TheTom commented Mar 28, 2026

Uh oh!

TheTom commented Mar 29, 2026

Uh oh!

apollosenvy commented Mar 29, 2026

Synced to TOT — HIP port on rocm-turbo3-v2

What was needed for HIP

PPL Confirmed

Uh oh!

apollosenvy commented Mar 29, 2026

TurboQuant+ Temporal Decay -- Implemented

How it works

Configuration

Benchmark (Qwen3.5-27B Q4_K_M, 7900 XTX)

Implementation (8 commits)

Uh oh!

TheTom commented Mar 30, 2026

What passes (head_dim=128 models)

Issue 1: TURBO_DECAY crashes on Metal

Issue 2: Boundary V modes 5/6/7 removed

Suggestion

Uh oh!

apollosenvy commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Synced to TOT — HIP port on `rocm-turbo3-v2`