Skip to content

feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3)#5

Open
apollosenvy wants to merge 9 commits intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:rocm-turbo3-pr
Open

feat: ROCm/HIP support for turbo3 KV cache (gfx1100/RDNA3)#5
apollosenvy wants to merge 9 commits intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:rocm-turbo3-pr

Conversation

@apollosenvy
Copy link
Copy Markdown

Summary

Adds complete ROCm/HIP backend support for turbo3 KV cache quantization, enabling 4.6x KV compression on AMD GPUs.

Tested on: AMD Radeon RX 7900 XTX (gfx1100), ROCm 7.1, Qwen3.5-27B Q4_K_M

What's included

  • GPU kernels: dequant, vec_dot, Walsh-Hadamard Transform, quantize, GET_ROWS, CPY
  • Flash attention: F16 dequant bridge (turbo3 -> F16 temp buffer -> F16 FA kernel). The native turbo3 FA template compiles but produces broken kernels on HIP/RDNA3, so we route through the existing need_f16_K/V conversion path in launch_fattn.
  • CPU fallback: vec_dot + type traits for graph scheduler fallback paths
  • All dispatch wiring: supports_op for MUL_MAT, GET_ROWS, SET_ROWS, CPY, FLASH_ATTN_EXT, TURBO_WHT

Critical bugfix: K/V rotation before cache write

Found and fixed a quality issue: K and V vectors were being quantized to turbo3 without WHT rotation. The graph rotated Q (forward) and inverse-rotated the attention output, but the K/V write path in cpy_k()/cpy_v() was missing the forward rotation step.

Before fix: <accepts high pressure "Itself" - ount>& & nsp; a- xii'
After fix: 7 x 8 = 56. This is one of the most common multiplication facts.

This fix is in src/llama-kv-cache.cpp and may also benefit the Metal path if it has the same omission.

Performance

Metric FP16 KV turbo3 KV Ratio
Prompt 90 tok/s 82 tok/s 91%
Generation 21 tok/s 12 tok/s 57%*
KV cache (32K ctx) 2048 MB 448 MB 4.6x

*Generation speed regression is from AMD_SERIALIZE_KERNEL=3 requirement (see below).

Known limitation

Requires AMD_SERIALIZE_KERNEL=3 environment variable. Without it, the multi-stream graph scheduler has a race condition between KV cache write and FA read ops. Per-op cudaStreamSynchronize is not sufficient. This is a ROCm runtime issue, not a turbo3 bug.

Files changed (24 files, +595 lines)

New files (5):

  • dequantize-turbo.cuh - Device dequant (centroid LUT)
  • vecdot-turbo.cuh - MMVQ vec_dot
  • turbo-wht.cu/cuh - WHT butterfly kernel
  • fattn-vec-instance-turbo3_0-turbo3_0.cu - FA template (D=256)

Modified (19): common.cuh, mmvq.cu, fattn-common.cuh, fattn-vec.cuh, fattn.cu, convert.cu, cpy-utils.cuh, cpy.cu, set-rows.cu, getrows.cu, ggml-cuda.cu, CMakeLists.txt (cuda+hip), ggml-turbo-quant.c, ggml-quants.h, ggml-cpu.c, quants.h, llama-kv-cache.cpp

Test plan

  • Standalone GPU kernel test (quantize/dequant round-trip)
  • Server starts with --cache-type-k turbo3 --cache-type-v turbo3
  • KV cache allocates at expected compression ratio
  • Inference produces coherent output
  • Math reasoning correct (7*8=56)
  • Perplexity benchmark vs upstream Metal numbers
  • Multi-turn conversation stability

seanrasch pushed a commit to seanrasch/llama-cpp-turboquant that referenced this pull request Mar 27, 2026
Complete experiment log:
  TheTom#1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  TheTom#2  Batched extract:     13.7 (+25%)
  TheTom#3  Inline FA block:     13.5 (I-cache pressure)
  TheTom#4  Deferred norm:       12.9 (loses ILP)
  TheTom#5  2-pair half2:        12.0 (ternary overhead)
  TheTom#6  Select chain:        11.9 (branches kill)
  TheTom#7  Bit-arithmetic:      11.6 (ALU too heavy)
  TheTom#8  FMA branchless:      11.4 (ALU still too heavy)
  TheTom#9  Named-reg ternary:   10.3 (branches worst)
  TheTom#10 Main (8-LUT):        10.95 (baseline)
  TheTom#11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 27, 2026

this is seriously impressive work, full ROCm backend in one PR. and thanks for finding the K/V rotation issue in cpy_k/cpy_v. our Metal path handles rotation differently (graph-side Q rotation + inverse rotation after attention, rather than rotating K/V before cache write), so it may not be the same bug, but i want to verify.

will test locally on M5 Max this afternoon to check for regressions before merging. would also appreciate it if you could run the PPL gate when you get a chance:

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 27, 2026

Test Results — M5 Max 128GB, Qwen3.5-35B-A3B Q8_0

Build: Clean compile on Metal.

⚠️ turbo3 — CRITICAL REGRESSION

PPL: 181.60 (expected: 6.18). Completely broken on Metal. Do not merge.

Root cause: The cpy_k()/cpy_v() changes add WHT rotation in the graph before cache write. On Metal, kernel_set_rows_turbo3 already performs WHT rotation during quantization (line ~503 in ggml-metal.metal). This causes K/V to be double-rotated, destroying quality.

The fix is correct for ROCm/HIP where SET_ROWS does not rotate, but it needs to be backend-gated:

// Only add graph-side rotation for backends where SET_ROWS doesn't rotate internally
if (k->type == GGML_TYPE_TURBO3_0 || k->type == GGML_TYPE_TURBO4_0) {
    if (/* backend is CUDA/HIP, not Metal */) {
        k_cur = ggml_turbo_wht(ctx, k_cur, 0);
    }
}

Or alternatively, remove the rotation from the Metal SET_ROWS kernel and do it in the graph for all backends (cleaner long-term, but requires Metal kernel changes).

ROCm code review (code-only, can't run on this hardware)

  • GPU kernels look structurally correct (dequant, vec_dot, WHT, quantize, GET_ROWS, CPY)
  • F16 dequant bridge for FA is a reasonable workaround for HIP FA template issues
  • CPU fallback additions are correct
  • AMD_SERIALIZE_KERNEL=3 requirement is documented

Recommendation

Do not merge as-is. Breaks Metal turbo3 (PPL 6.18 → 181.6). The rotation needs to be backend-gated, or the architecture needs to agree on where rotation happens (graph vs kernel). Happy to help work through the fix.

@apollosenvy
Copy link
Copy Markdown
Author

Good catch on the double-rotation. Pushed a fix: cpy_k()/cpy_v() rotation is now wrapped in #if !defined(GGML_USE_METAL) so it only applies on CUDA/HIP/CPU where SET_ROWS doesn't rotate internally.

Running PPL gate on our end now (wikitext-2-raw, turbo3 vs FP16 baseline on Qwen3.5-27B Q4_K_M, gfx1100). Will post results when done.

Re: the longer-term architecture question (graph vs kernel rotation) — happy to discuss. For now the preprocessor gate is minimal and correct for both paths. If you prefer moving Metal's kernel rotation into the graph to unify the approach, we can help with that in a follow-up.

@apollosenvy
Copy link
Copy Markdown
Author

PPL Gate Results

Model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Q4_K_M
Hardware: AMD 7900 XTX (gfx1100), ROCm 7.1
Dataset: wikitext-2-raw-v1, ctx=2048
Build: With #if !defined(GGML_USE_METAL) backend gate applied

KV Cache PPL vs FP16 Compression
FP16 6.809 ± 0.046 baseline 1x
Q8_0 6.817 ± 0.046 +0.1% 2x
turbo3 8.771 ± 0.064 +28.8% 4.6x

The +28.8% regression is higher than expected (~1% in your Metal benchmarks). This model is already Q4_K_M weight-quantized, so turbo3 KV quantization stacks on top of existing weight quantization loss. Would be useful to compare with an F16/BF16 weight model to isolate the KV-only contribution.

The Metal regression should be zero with the preprocessor gate since the graph-side rotation is now compiled out for Metal builds.

@apollosenvy
Copy link
Copy Markdown
Author

Multi-model PPL Gate Update

Attempted turbo3 PPL tests on all available GGUFs. Results:

Model Head Dim Weight Quant FP16 PPL turbo3 PPL Status
Qwen3.5-27B-Distill 256 Q4_K_M 6.81 8.77 (+29%) WORKS
Mistral-Small-24B 128 Q4_K_S 4.99 - Crash at init
GPT-OSS-20B 128? GGUF 466.9 118714 Runs but terrible
TinyLlama 1.1B 64 Q8_0 14.67 - Crash (head_dim < 128)
Phi-3 Mini 96 Q4 5.76 - Crash (head_dim < 128)
Qwen3-8B 128 F16 9.65 - Crash

Finding: turbo3 currently only works reliably on Qwen3.5 (head_dim=256). Models with head_dim=128 crash during KV cache initialization. This is likely because the WHT rotation group size (128) matches head_dim exactly, causing edge cases in the graph-side cast/rotation code.

The Qwen3.5 result (PPL +29%) is higher than expected for turbo3, but the model is already Q4_K_M weight-quantized. An F16/BF16 weight model would show the true turbo3-only regression.

Will investigate the head_dim=128 compatibility issue next.

@apollosenvy
Copy link
Copy Markdown
Author

Adaptive WHT + Multi-Head-Dim Support

Pushed 2c3bb08: fixes turbo3 crashes on all head_dim models.

What was broken

Three bugs prevented turbo3 from working on anything other than Qwen3.5 (head_dim=256):

  1. Dead dense rotation matrices (128x128 turbo_rotation/turbo_rotation_inv) from the abandoned graph-side matmul approach exhausted the ggml context object pool. Every model crashed at init.

  2. CPU dup handler didn't support turbo3 -> F16 cast. The scheduler routes some ops to CPU even with full GPU offload, hitting GGML_ABORT in the dup switch.

  3. WHT hardcoded to 128-element groups. Models with head_dim < 256 either crashed (128) or skipped rotation entirely (64, 96).

Fix: Adaptive Group Size

WHT group_size now computed as largest power of 2 dividing head_dim, capped at 128:

head_dim group_size groups/head butterfly stages
256 128 2 7 (unchanged)
128 128 1 7
96 32 3 5
64 64 1 6

Sign arrays truncated to first gs elements. Normalization: 1/sqrt(gs). All backends updated (CUDA/HIP, CPU). Metal unchanged (uses internal SET_ROWS rotation).

Benchmark Results (7900 XTX, ROCm 7.1)

Model Head Dim Status pp128 t/s tg32 t/s
Qwen3.5-27B Q4_K_M 256 No regression 662 24.7
Mistral-Small-24B Q4_K_S 128 Fixed (was crash) 600 24.2
TinyLlama-1.1B Q8_0 64 Fixed (was crash) 745 130

PPL Investigation (head_dim=128)

Mistral-Small PPL with turbo3 is catastrophic (~15000 vs 5.16 F16 baseline). Tested both with and without WHT rotation -- PPL is nearly identical either way, confirming the issue is the 3-bit centroid quantization itself, not the rotation. The centroids may need model-family-specific calibration, or head_dim=128 models may need 4-bit (turbo4) for acceptable quality.

Q8_0 KV cache on the same model gives PPL 5.17 (matching F16), so the model's attention values are perfectly quantizable -- just not at 3 bits.

Next: investigating whether turbo4 (4-bit) resolves the quality gap for head_dim=128 models.

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 28, 2026

thanks for the thorough work here, especially the adaptive WHT fix and tracking down the dead rotation matrices exhausting the context pool. the multi-model compatibility sweep is exactly the kind of testing that catches real issues.

i'm currently tied up with some other validation runs but will get to testing and review once my bench frees up. the head_dim=128 PPL being catastrophic at turbo3 lines up with what buun's been seeing on the CUDA side too — turbo4 path for head_dim=128 is likely the right direction.

will follow up once i can give this a proper look.

@apollosenvy
Copy link
Copy Markdown
Author

Root Cause: PolarQuant Reconstruction Model vs Uniform Quantization

Dug deeper into the head_dim=128 PPL catastrophe. It's NOT a bit depth problem:

KV Type Bits/val Mistral-Small PPL Reconstruction Model
F16 16 5.16 exact
q8_0 8 5.17 scale * idx
q5_0 5 5.18 scale * idx + offset
q4_0 4 5.26 scale * idx
q4_1 4.5 5.21 scale * idx + min
turbo4 4.25 5574 norm * centroid[idx]
turbo3 3.5 15518 norm * centroid[idx]

turbo4 has MORE bits than q4_0 but 1000x worse PPL. The issue is PolarQuant's reconstruction model: value = norm * centroid[idx]. This assumes all elements in a block share the same scale (the block L2 norm) and the centroids are at fixed positions. Uniform quantization (q4_0) adapts its scale and zero point per block.

Tried recalibrating centroids for the actual unit-sphere distribution (d=32) -- the current centroids are tuned for N(0,1/128) which has 2x narrower range than the per-block-normalized distribution. Wider centroids didn't help (PPL got worse, possibly due to CPU/GPU centroid mismatch on layer 0).

The fundamental issue: Mistral-Small's attention heads have non-uniform value distributions where the PolarQuant assumption (values are roughly equi-distributed around the block centroid) breaks down. Some dimensions carry disproportionate signal.

Possible fixes:

  1. Hybrid quantization: Use turbo3/turbo4 for Qwen-family models (where it works), fall back to q4_0/q4_1 for others
  2. Adaptive reconstruction: Replace norm * centroid[idx] with scale * centroid[idx] + offset (adds 2 bytes per block but adapts like q4)
  3. Per-head calibration: Different centroid sets per attention head

Also pushed turbo4 CUDA/HIP implementation (works on Qwen3.5-27B, PPL 2.28) and corrected centroids analysis to the branch. Holding push until you review.

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 28, 2026

apollosenvy
commented
1 minute ago
Root Cause: PolarQuant Reconstruction Model vs Uniform Quantization
Dug deeper into the head_dim=128 PPL catastrophe. It's NOT a bit depth problem:

KV Type Bits/val Mistral-Small PPL Reconstruction Model
F16 16 5.16 exact
q8_0 8 5.17 scale * idx
q5_0 5 5.18 scale * idx + offset
q4_0 4 5.26 scale * idx
q4_1 4.5 5.21 scale * idx + min
turbo4 4.25 5574 norm * centroid[idx]
turbo3 3.5 15518 norm * centroid[idx]
turbo4 has MORE bits than q4_0 but 1000x worse PPL. The issue is PolarQuant's reconstruction model: value = norm * centroid[idx]. This assumes all elements in a block share the same scale (the block L2 norm) and the centroids are at fixed positions. Uniform quantization (q4_0) adapts its scale and zero point per block.

Tried recalibrating centroids for the actual unit-sphere distribution (d=32) -- the current centroids are tuned for N(0,1/128) which has 2x narrower range than the per-block-normalized distribution. Wider centroids didn't help (PPL got worse, possibly due to CPU/GPU centroid mismatch on layer 0).

The fundamental issue: Mistral-Small's attention heads have non-uniform value distributions where the PolarQuant assumption (values are roughly equi-distributed around the block centroid) breaks down. Some dimensions carry disproportionate signal.

Possible fixes:

Hybrid quantization: Use turbo3/turbo4 for Qwen-family models (where it works), fall back to q4_0/q4_1 for others
Adaptive reconstruction: Replace norm * centroid[idx] with scale * centroid[idx] + offset (adds 2 bytes per block but adapts like q4)
Per-head calibration: Different centroid sets per attention head
Also pushed turbo4 CUDA/HIP implementation (works on Qwen3.5-27B, PPL 2.28) and corrected centroids analysis to the branch. Holding push until you review.

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 28, 2026

Two things here:

On the adaptive WHT fix (2c3bb08) — really solid work. The dead rotation matrices exhausting the context pool is something I’d hit as well but hadn’t pushed a fix for on CUDA/HIP yet. CPU dup handler and adaptive group sizing are nice catches. Good to see hd64/96/128 actually running.

On the PolarQuant reconstruction analysis — I’d be careful about calling it a fundamental issue just yet. There are still pipeline-level bugs that can produce exactly the kind of catastrophic PPL you’re seeing.

Two data points from my side:

  1. On Qwen3.5-27B, your turbo3 is +28.8% vs q8_0. My runs on the same model family are ~+1%. That’s a huge gap on a model where turbo is known to behave well, which suggests something upstream of reconstruction is off.

  2. buun is seeing turbo4 at ~+1.2% vs q8_0 on head_dim=128 models on CUDA. So the hd128 gap seems real but small (1–3%), not catastrophic.

I just finished bringing turbo4 back up on Metal and ended up finding several bugs that massively distorted PPL (one SET_ROWS packing issue alone took it from ~679 down to ~6.1). So I’d focus on ruling out pipeline issues first.

A few specific things I’d check:

  • SET_ROWS writing the correct turbo4 block layout (shared turbo3/turbo4 paths can easily corrupt data)
  • CPU/GPU centroid mismatch (especially if layer 0 falls back to CPU)
  • turbo4 pre-rotation guard (buun flagged a missing check for TURBO4_0)

Also, your turbo4 CUDA/HIP result on Qwen (PPL ~2.28) is actually a strong signal that the reconstruction model itself is sound when the pipeline is correct. The interesting question is what differs between that path and the Mistral one beyond head_dim.

Happy to take a look once you’ve had a chance to cross-check — no rush.

@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 29, 2026

Please sync to TOT and I'll review and test.

…rnels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@apollosenvy
Copy link
Copy Markdown
Author

Synced to TOT — HIP port on rocm-turbo3-v2

Rebased onto your latest feature/turboquant-kv-cache (172fc85). Clean single commit: HIP/ROCm porting for your warp-cooperative turbo3 kernel + turbo2/turbo3 FA templates.

What was needed for HIP

  1. Vendor header (hip.h): Added cudaMemcpyToSymbol/FromSymbol mappings. Fixed __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support 3-arg calls (CUDA defaults width to warpSize, old HIP macros required 4 args). Added __ballot_sync -> __ballot with uint32_t cast.

  2. HIP CMakeLists: Added turbo3 and turbo2 FA template instances (turbo3-turbo3, turbo3-q8_0, q8_0-turbo3, turbo2-turbo2, turbo2-q8_0, q8_0-turbo2).

PPL Confirmed

You were right about the pipeline bugs. With the CPU quantize stub fix (53f1298):

Model KV Type PPL vs F16
Mistral-Small-24B (hd128) F16 5.16 baseline
Mistral-Small-24B (hd128) turbo3 5.28 +2.4%
Mistral-Small-24B (hd128) q4_0 5.26 +1.9%

PolarQuant IS sound. The catastrophic PPL was 100% the CPU quantize stub zeroing qs/signs on layer 0. My earlier "fundamental limitation" analysis was wrong — thanks for pushing back on that.

Branch is rocm-turbo3-v2 on fork. Ready for review.

Tuklus and others added 8 commits March 28, 2026 22:12
Two-dimensional KV cache compression: static per-layer type selection
(vertical, existing TURBO_LAYER_ADAPTIVE) combined with dynamic
per-position quality degradation (horizontal, new TURBO_DECAY).

Old tokens get their QJL refinement bits zeroed in-place, reducing
effective precision without changing the ggml tensor type. Tier
boundaries shift dynamically with context length. Re-quantization
piggybacks on SET_ROWS dispatch -- one memset per block per token.

Three presets: conservative (25/40/35), balanced (15/35/50),
aggressive (5/25/70). Custom percentages via TURBO_DECAY=H,W,C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Task 1: Config struct + env var parsing
Task 2: ggml TURBO_DECAY op declaration
Task 3: CUDA/HIP/CPU decay kernel (signs zeroing)
Task 4: Wire into cpy_k/cpy_v graph dispatch
Task 5: PPL validation benchmarks
Task 6: Long-context stress test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- turbo3 signs field is 3rd centroid bit, not QJL. Zeroing reduces
  8-level to 4-level reconstruction (corrected description)
- turbo4 has two compile-time variants: 4-bit (no QJL, skip decay)
  vs legacy 3-bit+QJL (zero rnorm). Both documented.
- Boundaries computed in position space, not cell index space
  (ring buffer ordering is non-monotonic)
- Multi-sequence policy: conservative (cold for ALL sequences)
  Phase 1 single-sequence only.
- VRAM claim removed (decay doesn't reduce VRAM, it improves
  quality-at-equal-VRAM)
- Added edge cases: SWA, seq_rm, defrag, serialization, idempotency
- Clarified graph execution model (ggml op, not side-effect)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the configuration infrastructure for TurboQuant+ temporal decay:
- turbo_decay_config struct (public, in llama_kv_cache) with hot/warm/cold
  percentage splits and hot_promote flag
- parse_turbo_decay() static function reads TURBO_DECAY env var with named
  presets (conservative/balanced/aggressive) or custom h,w,c splits
- TURBO_DECAY_HOT_PROMOTE env var for phase 2 q8_0 promotion
- Per-layer last_cold_boundary tracking vector initialized in constructor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds the GGML_OP_TURBO_DECAY enum value, ggml_turbo_decay() factory
function, and name/symbol table entries. The op records a position
range [cold_start, cold_end) in op_params; the backend kernel (Task 3)
will zero the signs field of turbo3/turbo4 KV cache blocks in that
range. GGML_OP_COUNT static_assert bumped to 98.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements GGML_OP_TURBO_DECAY backend kernels:
- CUDA/HIP: k_turbo3_decay zeros signs[] to collapse 3-bit to 2-bit
  reconstruction; k_turbo4_decay (legacy only) also zeros rnorm
- CPU: single-threaded memset fallback for both turbo3 and turbo4
- Properly gated behind TURBO4_USE_4BIT for turbo4 (4-bit variant
  has no signs field to zero)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After writing new KV entries via ggml_set_rows, append a
ggml_turbo_decay node that demotes old positions past the cold
boundary. Only fires for turbo3/turbo4 caches in single-sequence
mode when TURBO_DECAY is enabled. The last_cold_boundary vector
is made mutable so const cpy_k/cpy_v methods can track boundary
advancement per layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen3.5-27B Q4_K_M, turbo3 KV, 7900 XTX, ctx=2048, 4 chunks:

PPL:
  no decay:   7.58 (baseline)
  balanced:   8.43 (+11.2%)
  aggressive: 9.81 (+29.4%)

Speed (pp128 / tg128):
  no decay:   528 / 19.07 t/s
  balanced:   530 / 18.62 t/s (within noise)

PPL regression is expected at short context (50-70% of 2K tokens
get degraded). At longer contexts the cold tier covers older tokens
with lower attention weight, so the quality impact decreases.

Zero measurable speed overhead from decay op.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@apollosenvy
Copy link
Copy Markdown
Author

TurboQuant+ Temporal Decay -- Implemented

Pushed to rocm-turbo3-v2: position-aware KV cache quality degradation. Old tokens get their signs field zeroed in-place, reducing turbo3 from 8-level to 4-level centroid reconstruction.

How it works

Three dynamic tiers with proportional boundaries:

  • Hot (newest 15%): full quality
  • Warm (middle 35%): base turbo3 quality
  • Cold (oldest 50%): signs zeroed, 4-level reconstruction

Boundaries shift proportionally as context grows. Demotion is a single ggml_turbo_decay op appended to the graph after SET_ROWS. One kernel launch per layer per boundary advance.

Configuration

TURBO_DECAY=balanced          # 15/35/50 (default)
TURBO_DECAY=conservative      # 25/40/35
TURBO_DECAY=aggressive        # 5/25/70
TURBO_DECAY=10,30,60          # custom percentages

Benchmark (Qwen3.5-27B Q4_K_M, 7900 XTX)

PPL (ctx=2048, 4 chunks):

Mode PPL vs No-Decay
off 7.58 baseline
balanced 8.43 +11.2%
aggressive 9.81 +29.4%

Speed: Zero measurable overhead (within noise).

The PPL hit at 2K is expected -- at short context, the cold tier covers tokens the model still actively attends to. At longer contexts (16K+), the cold tier covers much older tokens with naturally lower attention weight, so the quality impact should decrease. This is the design tradeoff: sacrifice some short-context quality for better long-context compression efficiency.

Implementation (8 commits)

  1. Design spec + reviewer feedback fixes
  2. Implementation plan
  3. Config struct + env var parsing
  4. GGML_OP_TURBO_DECAY op
  5. CUDA/HIP/CPU decay kernels
  6. Wire into cpy_k/cpy_v graph
  7. PPL validation
  8. Push

Phase 2 (hot tier q8_0 promotion) and Phase 3 (attention-guided decay) designed but deferred.

@github-actions github-actions bot added documentation Improvements or additions to documentation Nvidia GPU ggml labels Mar 29, 2026
@TheTom
Copy link
Copy Markdown
Owner

TheTom commented Mar 30, 2026

Tested on M5 Max 128GB (Metal) and M2 Pro 32GB (Metal). Thanks for the work here, some notes below.

What passes (head_dim=128 models)

PPL and speed are clean for standard models:

Config PPL (8-chunk) pp512 t/s tg128 t/s Known baseline
turbo3 6.1756 2728 75.14 6.1756 (match)
turbo4 6.1250 6.1250 (match)

Build passes on Metal with no warnings.

Issue 1: TURBO_DECAY crashes on Metal

The temporal decay op has CUDA/HIP/CPU kernels but no Metal kernel. When TURBO_DECAY=balanced is set with a GPU-offloaded model, it aborts:

pre-allocated tensor (cache_k_l3 (view) (view)) in a buffer (MTL0) that cannot run the operation (TURBO_DECAY)

Repro:

TURBO_DECAY=balanced ./build/bin/llama-cli \
  -m model.gguf -ngl 99 -fa on \
  -ctk turbo3 -ctv turbo3 \
  -p "Write a 500 word essay about the history of computing" -n 500

This is a crash, not a graceful fallback. The decay op needs either a Metal kernel or a supports_op guard that prevents dispatch to the Metal backend. As-is, any Metal user who sets the env var will hit an abort.

Issue 2: Boundary V modes 5/6/7 removed

The PR removes layer-adaptive modes 5, 6, and 7 (Boundary V) from the constructor. These were shipped in PR #30 and are part of our current feature set. TURBO_LAYER_ADAPTIVE=5 no longer produces a "Boundary V" log line, confirming the code path is gone.

Suggestion

Consider splitting this into separate PRs:

  1. ROCm/HIP kernel support (the core contribution, looks solid)
  2. Temporal decay (experimental feature, needs Metal kernel + more validation)
  3. Non-128 head_dim q8_0 fallback (architectural decision worth discussing separately)

Happy to re-review individual pieces. The ROCm work is great and would be straightforward to land on its own.

@apollosenvy
Copy link
Copy Markdown
Author

Split per your suggestion:

  • PR feat: HIP/ROCm support for turbo3/turbo2 (7900 XTX) #31 — ROCm/HIP port only. Single commit, minimal. Rebased on your latest (includes Boundary V). Also excludes D>=576 fattn-tile instances that exceed HIP's 64KB local memory limit.

  • Temporal decay — Will open as separate PR once Metal kernel or supports_op guard is added. The decay op shouldn't dispatch to backends that don't implement it.

  • head_dim fallback — Your commit 4c4511c14 already handles this. No separate PR needed from our side.

Re: modes 5/6/7 — we didn't remove them. Our branch was based on a commit before PR #30. The new PR #31 is rebased on your latest which includes Boundary V, so modes 5/6/7 are intact.

TheTom added a commit that referenced this pull request Apr 2, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  #4  Deferred norm:       12.9 (loses ILP)
  #5  2-pair half2:        12.0 (ternary overhead)
  #6  Select chain:        11.9 (branches kill)
  #7  Bit-arithmetic:      11.6 (ALU too heavy)
  #8  FMA branchless:      11.4 (ALU still too heavy)
  #9  Named-reg ternary:   10.3 (branches worst)
  #10 Main (8-LUT):        10.95 (baseline)
  #11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request Apr 2, 2026
Complete experiment log:
  #1  4-mag LUT:           15.1 at 8K (BEST, +38%)
  #2  Batched extract:     13.7 (+25%)
  #3  Inline FA block:     13.5 (I-cache pressure)
  #4  Deferred norm:       12.9 (loses ILP)
  #5  2-pair half2:        12.0 (ternary overhead)
  #6  Select chain:        11.9 (branches kill)
  #7  Bit-arithmetic:      11.6 (ALU too heavy)
  #8  FMA branchless:      11.4 (ALU still too heavy)
  #9  Named-reg ternary:   10.3 (branches worst)
  #10 Main (8-LUT):        10.95 (baseline)
  #11 Non-vec FA:          10.2 (wrong kernel)
  Ceiling:                 24.5 (no dequant)

Apple8 hardware truth:
  1 divergent constant read < 7 ALU ops (even with fma)
  Branches cost MORE than divergent constant reads
  Array indexing ALWAYS spills on Metal
  4 constant addresses is the sweet spot

The 4-mag LUT is the dequant-level ceiling on Apple Silicon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml Nvidia GPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants