Skip to content

perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion#57

Merged
TheTom merged 5 commits into
TheTom:feature/turboquant-kv-cachefrom
signalnine:pr/tq4-weight-opts
Apr 7, 2026
Merged

perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion#57
TheTom merged 5 commits into
TheTom:feature/turboquant-kv-cachefrom
signalnine:pr/tq4-weight-opts

Conversation

@signalnine

@signalnine signalnine commented Apr 6, 2026

Copy link
Copy Markdown

Summary

Autoresearch-discovered optimizations make the native TQ4_1S weight kernel 3.5× faster (68→240 t/s) and 36% faster than the q8_0 load-time conversion (176 t/s) while using 1.7× less VRAM (4.5 vs 7.5 GiB).

The load-time conversion is now opt-in (GGML_TQ_CONVERT_Q8=1). The native kernel is strictly better on both speed and VRAM.

Benchmarks (RTX 5090, Qwen2.5-7B TQ4_1S)

Speed

Path t/s VRAM
Upstream V12 runtime 67 4.5 GiB
q8_0 load-time conversion 176 7.5 GiB
Native optimized (this PR) 240 4.5 GiB

Quality (vs f16 baseline PPL 7.18)

Metric Native TQ4_1S q8_0 conversion q4_0
PPL 7.54 7.55 7.85
Mean KLD 0.056 0.057 0.078
Coherence 4/4 ✅
NIAH 5/5 ✅

Key optimizations (86 automated experiments, 15 kept)

  1. fp16 activation buffer — halves activation bandwidth (the bottleneck)
  2. Shared-memory centroid LUT — eliminates constant memory serialization on divergent lane access (+89%)
  3. Half2 arithmetic + strided block processing — 2× arithmetic density
  4. Vectorized 128-bit loads — uint32×4 weights, int4 activations (+45%)
  5. Register __byte_perm centroid decode — zero-memory centroid lookup
  6. NWARPS 8→4

Also includes

  • Load-time q8_0 conversion now opt-in (was default)
  • Autoresearch harness with coherence testing (server API + factual Q&A gate)

Test plan

  • Speed: 240 t/s (+253% vs upstream, +36% vs q8_0 conversion)
  • PPL: 7.54 (matches q8_0 conversion quality)
  • KLD: 0.056 (lower than q8_0 conversion and q4_0)
  • NIAH: 5/5 (Kamradt varied filler)
  • Coherence: 4/4 (Paris, 4, print, Shakespeare)
  • Ampere (sm_86) testing
  • TQ3_1S regression check

…AM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7× LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer — halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT — eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing — 2× arithmetic density
4. Vectorized 128-bit loads — uint32×4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode — zero-memory centroid lookup
6. NWARPS 8→4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  ← this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)
@signalnine

Copy link
Copy Markdown
Author

TQ3_1S Regression Check — PASS

Metric TQ3_1S TQ4_1S
Speed 102 t/s 240 t/s
PPL 7.43 7.54
Coherence 3/3 ✅ 4/4 ✅
Size 3.71 GiB 4.53 GiB

TQ3_1S uses the unchanged V8 scalar fallback path — no regression. The optimizations (shmem LUT, half2, __byte_perm, vectorized loads) only apply to the TQ4_1S dp4a kernel.

@signalnine signalnine marked this pull request as ready for review April 6, 2026 17:23
@TheTom

TheTom commented Apr 6, 2026

Copy link
Copy Markdown
Owner

Regression Test Results

Mac (M5 Max 128GB, Metal)

Build: clean, zero errors. Qwen2.5-1.5B Q4_K_M:

Config pp512 tg128
Baseline (f16 KV) 10,095 276.3
q8_0/turbo4 KV 9,666 172.9

No regressions. Metal path unaffected by CUDA changes.

Windows (GTX 1080 Ti 11GB, CUDA 12.6, SM 6.1)

Build: clean, zero errors. Qwen2.5-7B-Instruct:

Config pp512 tg128 Previous Delta
Q8_0 baseline 1,145 36.4 950 / 36.1 +20% prefill
Q8_0 + turbo4 KV 1,128 36.3 — / 36.2 match
Config I (TQ4_1S) 587 51.0 956 / 38.4 +33% decode
Config I + turbo4 KV 583 51.4 1110 / 39.0 +32% decode

Config I decode improvement (38.4 → 51.0 tok/s) is the native TQ4_1S kernel optimization showing up on Pascal. Prefill regression on Config I (956 → 587) needs investigation — may be related to the load-time conversion change (now opt-in).

All 4 configs produce valid output. No crashes, no build warnings from the PR's files.

@signalnine

Copy link
Copy Markdown
Author

Prefill regression confirmed — root cause identified

The prefill regression (956→587 on Pascal, 13K→6.5K on 5090) is from disabling the load-time q8_0 conversion. Native TQ4_1S prefill falls through to cuBLAS dequant-to-f16, which requires a full inverse WHT per element — much slower than the q8_0 cuBLAS path that only needs standard q8_0 dequant.

Path PP512 TG128 VRAM
Native TQ4_1S 6,479 226 4.5 GiB
q8_0 conversion 13,312 176 7.5 GiB

The tradeoff is clear: native wins decode (+29%), q8_0 wins prefill (+105%). Need a hybrid that uses native for decode and q8_0 for prefill — or accept the prefill cost for the VRAM savings.

Options:

  1. Default to conversion (old behavior) — best prefill, OK decode, more VRAM
  2. Default to native (current PR) — best decode, poor prefill, less VRAM
  3. Hybrid: convert at load time but also keep native kernel for decode — best of both but 1.7× VRAM (stores both q8_0 AND TQ4_1S... defeats the purpose)
  4. User choice via env var — GGML_TQ_NATIVE=1 for decode-heavy workloads

For now I'll switch the default back to conversion (option 1) since prefill is more commonly the bottleneck, and document the native opt-in for decode-heavy use cases. The +33% decode on Pascal and +29% on 5090 is real and valuable for interactive chat with long context.

@TheTom

TheTom commented Apr 6, 2026

Copy link
Copy Markdown
Owner

Testing on my AMD now for posterity, Asking a community member to verify with his CUDA.

@TheTom

TheTom commented Apr 6, 2026

Copy link
Copy Markdown
Owner

Prefill regression confirmed — root cause identified

The prefill regression (956→587 on Pascal, 13K→6.5K on 5090) is from disabling the load-time q8_0 conversion. Native TQ4_1S prefill falls through to cuBLAS dequant-to-f16, which requires a full inverse WHT per element — much slower than the q8_0 cuBLAS path that only needs standard q8_0 dequant.

Path PP512 TG128 VRAM
Native TQ4_1S 6,479 226 4.5 GiB
q8_0 conversion 13,312 176 7.5 GiB
The tradeoff is clear: native wins decode (+29%), q8_0 wins prefill (+105%). Need a hybrid that uses native for decode and q8_0 for prefill — or accept the prefill cost for the VRAM savings.

Options:

  1. Default to conversion (old behavior) — best prefill, OK decode, more VRAM
  2. Default to native (current PR) — best decode, poor prefill, less VRAM
  3. Hybrid: convert at load time but also keep native kernel for decode — best of both but 1.7× VRAM (stores both q8_0 AND TQ4_1S... defeats the purpose)
  4. User choice via env var — GGML_TQ_NATIVE=1 for decode-heavy workloads

For now I'll switch the default back to conversion (option 1) since prefill is more commonly the bottleneck, and document the native opt-in for decode-heavy use cases. The +33% decode on Pascal and +29% on 5090 is real and valuable for interactive chat with long context.

good analysis on the prefill tradeoff. option 1 is the right call. defaulting to conversion keeps existing behavior, and documenting native opt-in for decode-heavy workloads gives users the choice without surprising anyone.

found a compile issue on AMD/ROCm will patch with HIP guards. wanted to flag it before merge.

+33% decode on pascal is a great result. the autoresearch harness is proving its worth.

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part — the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
@TheTom

TheTom commented Apr 6, 2026

Copy link
Copy Markdown
Owner

heads up from a community tester — turbo-innerq.cuh has a build issue on static builds (BUILD_SHARED_LIBS=OFF). The TURBO_IQ_API macro assumes GGML_BACKEND_BUILD is defined, but it's only set for shared builds. Fix: wrap the block in #ifdef GGML_BACKEND_SHARED matching the pattern from ggml-backend.h. For static builds TURBO_IQ_API should be empty (no dllexport/dllimport).

@signalnine

Copy link
Copy Markdown
Author

Static build fix + multi-token kernel update

Fixed the TURBO_IQ_API static build issue — wrapped in #ifdef GGML_BACKEND_SHARED matching the GGML_BACKEND_API pattern from ggml-backend.h. For BUILD_SHARED_LIBS=OFF, TURBO_IQ_API is now empty (no dllexport/dllimport).

Also added a multi-token TQ4_1S dp4a kernel for ne[1] ≤ 8 (speculative decoding, small batches). Loads weight data once per block, reuses across all tokens. For ne[1] > 8, native path now does runtime TQ4_1S→fp16 dequant + cuBLAS instead of the old per-element cuBLAS fallback.

Default remains load-time q8_0 conversion (option 1, as agreed). GGML_TQ_NATIVE=1 opts into native decode for VRAM savings.

Config pp512 tg128 VRAM
Default (q8_0 conversion) 13,291 175 7.5 GiB
GGML_TQ_NATIVE=1 5,492 226 4.5 GiB

Re: HIP — happy to coordinate on the guards once your patch is ready.

@TheTom

TheTom commented Apr 6, 2026

Copy link
Copy Markdown
Owner

Community testing: dual RTX 4090 results

Tested on dual 4090s with layer split. Found a multi-GPU bug and fix.

Results

Metric Native TQ4_1S q8_0 Conversion Delta
Prefill pp32768 2,318 t/s 3,513 t/s q8_0 wins +51.5%
Decode tg128 38.55 t/s 32.05 t/s Native wins +20.3%

Multi-GPU bug

Native TQ4_1S kernel crashes during decode on multi-GPU (dual 4090s). Prefill works fine (goes through cuBLAS mat_mul). Decode hits the native kernel which uses global static CUDA device pointers (static half * d_act_buf). cudaMalloc allocates on the current device, so:

  1. First call on GPU 0 allocates buffers on GPU 0
  2. Next call on GPU 1 tries to use GPU 0's buffers from GPU 1 — illegal memory access

q8_0 conversion path works for both prefill and decode because it uses standard q8_0 mul_mat_vec.

Fix

Replace static global buffers with per-device pool allocations from ctx.pool() via ggml_cuda_pool_alloc, matching the pattern in mmvq.cu. Changes in mmvq-tq.cu:

  • Remove static d_act_buf / d_q8_1_buf globals
  • Use ggml_cuda_pool_alloc for activation buffers (per-call, per-device)
  • Update kernel launch calls to use q8_1_buf.ptr / act_buf.ptr instead of static pointers

This makes the buffers per-device and per-call, fixing multi-GPU decode.

@signalnine

Copy link
Copy Markdown
Author

Multi-GPU fix applied

Replaced static global d_act_buf / d_q8_1_buf with per-device ggml_cuda_pool_alloc from ctx.pool(id), matching the pattern in mmvq.cu. Buffers are now per-call and per-device — fixes the illegal memory access on dual GPU decode.

No performance regression on single GPU (pp512=5,474, tg128=225 with GGML_TQ_NATIVE=1).

- Multi-token dp4a kernel for ne[1]≤8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1S→fp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
@TheTom

TheTom commented Apr 6, 2026

Copy link
Copy Markdown
Owner

AMD HIP/ROCm testing: RX 9070 XT (RDNA 4, gfx1201)

Tested by community member on Windows 11, HIP SDK 7.1, Qwen2.5-1.5B-Instruct.

Build fix required

__dp4a is NVIDIA-only. Needs to use the vendor-abstracted ggml_cuda_dp4a wrapper from common.cuh (8 replacements in mmvq-tq.cu). Diff:

-        const int s0 = __dp4a(c0, a0, 0);
+        const int s0 = ggml_cuda_dp4a(c0, a0, 0);

Regression results (Qwen2.5-1.5B, RX 9070 XT)

Config Size PPL Prefill pp512 Decode tg128 vs Q8_0
Q8_0 1.76 GiB 11.119 6,917 t/s 167 t/s baseline
Config I (TQ4_1S) 1.28 GiB 11.316 4,733 t/s 101 t/s tg 60%

PPL: No regression (+1.8% vs Q8_0).

Regression vs previous kernel (PR #45 V12)

Kernel Decode Prefill TG vs Q8_0
PR 45 (V12 float) 135 t/s 7,134 t/s 130%
PR 57 (dp4a int8) 101 t/s 4,733 t/s 60%

PR 57's dp4a path is optimized for NVIDIA tensor cores (Ampere+/Blackwell) where it achieves 240 t/s (+3.5x). On RDNA 4, the sudot4 intrinsic has different throughput characteristics and the int8 activation path adds conversion overhead, resulting in a regression vs the V12 float kernel.

May need an architecture-specific fallback to V12 for AMD GPUs, or an #ifdef to select the better path per vendor.

@signalnine

Copy link
Copy Markdown
Author

Fixed — replaced all 8 __dp4a calls with ggml_cuda_dp4a from common.cuh. Pushed.

Re: RDNA4 regression — the dp4a/int8 path was designed around NVIDIA's dp4a throughput (Turing+). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead that the float path doesn't have. Two options:

  1. Arch dispatch: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time, route AMD to the V12 float kernel (pre-rotate to half + shmem centroid LUT), keep dp4a for NVIDIA
  2. Runtime heuristic: benchmark both paths on first call, cache the winner

Option 1 is simpler and probably sufficient. Happy to add the AMD fallback path if you want it in this PR, or we can split it out.

@TheTom

TheTom commented Apr 7, 2026

Copy link
Copy Markdown
Owner

RDNA4 arch dispatch fix — cherry-pick ready

Branch: fix/rdna4-arch-dispatch
Commit: 96e211d

What it does

Routes AMD GPUs to a scalar half-precision TQ4_1S kernel instead of the dp4a int8 path. NVIDIA continues using dp4a.

  • New kernel: mul_mat_tq4_1s_scalar_multi — pre-rotated half activations, shmem centroid LUT (16 entries), scalar dot product. Same pattern as TQ3_1S, no dp4a/__byte_perm.
  • Dispatch: use_dp4a = !GGML_CUDA_CC_IS_AMD(cc) && TQ4_1S. AMD falls through to the scalar path alongside TQ3_1S.
  • LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch to avoid duplicating the switch statements.

Expected RDNA4 result

Restore V12-level decode (~135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0).

Verified

  • Metal build: clean, zero errors (mmvq-tq.cu is CUDA-only, no Metal impact)
  • One file changed: ggml/src/ggml-cuda/mmvq-tq.cu (+98, -20)

Cherry-pick

git fetch https://github.com/TheTom/llama-cpp-turboquant.git fix/rdna4-arch-dispatch
git cherry-pick 96e211d

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@signalnine

Copy link
Copy Markdown
Author

Cherry-picked 96e211d — builds clean, no NVIDIA regression (pp512=5,140, tg128=225). Pushed.

Clean approach with the LAUNCH_SCALAR macro. Looking forward to the RDNA4 retest numbers.

@TheTom

TheTom commented Apr 7, 2026

Copy link
Copy Markdown
Owner

Thanks I'll test it out on my local PC.

@TheTom

TheTom commented Apr 7, 2026

Copy link
Copy Markdown
Owner

RDNA4 retest — RX 9070 XT (gfx1201, HIP SDK 7.1) — ALL PASS

With the arch dispatch fix (commit 96e211d cherry-picked by @signalnine):

Speed (Qwen2.5-1.5B-Instruct, r=10)

Config Size Prefill pp512 Decode tg128 vs Q8_0
Base Q8_0 1.76 GiB 12,412 140 baseline
PR57 Q8_0 1.76 GiB 12,365 156 no regression
PR57 Config I (TQ4_1S) 1.28 GiB 8,922 166 +19%

Before vs after fix

Metric dp4a path (before) Scalar half path (after)
Config I decode 101 t/s (60% of Q8_0) 166 t/s (119% of Q8_0)

PPL (wikitext-2-raw, 512 context, 20 chunks)

Config PPL
Q8_0 11.119
Config I 11.337 (+2.0%)

PPL: PASS — no regression.

Output quality

Prompt: "What is the capital of France? Answer briefly."
Output: "The capital of France is Paris."

PASS — coherent, correct.

Note for AMD testers

RDNA 4 has severe cold-start JIT variance. With r=3 or r=5, the first run drags the average down to 24-34 t/s. Always use r=10 on this GPU to get stable warm-state numbers.

@TheTom TheTom merged commit a4e8af4 into TheTom:feature/turboquant-kv-cache Apr 7, 2026
12 of 47 checks passed
@seanrasch

Copy link
Copy Markdown

Ampere (RTX 3080 Ti, SM 86) — native kernel parity with q8_0 convert

Tested this PR (pr/tq4-weight-opts @ 686220987) on RTX 3080 Ti. The 36% speedup you saw on RTX 5090 does not transfer to Ampere — native and q8_0 convert are within noise on SM 86.

Setup

  • Hardware: RTX 3080 Ti (SM 86, 12GB, ~760 GB/s bandwidth)
  • Model: Qwen2.5-7B-Instruct-TQ4_1S (quantized locally from bartowski f16)
  • Build: cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON
  • Tool: llama-bench -p 512 -n 128 -r 3, single GPU via CUDA_VISIBLE_DEVICES=1

Results

Path pp512 (t/s) tg128 (t/s)
Native (this PR) 4704 ± 186 94.05 ± 0.15
q8_0 convert (GGML_TQ_CONVERT_Q8=1) 4664 ± 183 93.98 ± 0.21

Delta: +0.9% pp512, 0% tg128 — both within run-to-run variance.

Interpretation

On 5090 you saw native beat convert 240 → 176 t/s (+36%). On 3080 Ti they're identical. The PR's kernel optimizations (fp16 activation buffer, SMEM centroid LUT, half2 strided blocks, vectorized 128-bit loads, __byte_perm centroid decode, NWARPS 8→4) appear to be Ada/Blackwell-specific wins. On Ampere we're bottlenecked elsewhere — likely memory bandwidth for weight loading, or insufficient SMEM/L2 to realize the LUT gains.

Not a regression — native is at least as fast as convert on Ampere, and the lower-VRAM advantage still holds. Just not the headline 3.5x. The case for merging still stands on Ampere: equal speed, lower memory, simpler code path.

Cross-datapoint on the same build, Qwen3.5-9B TQ4_1S (48 layers, 8:1 GQA): convert actually edges native by ~5% pp512 (3776 vs 3592) — consistent with "native optimizations don't help on Ampere", and may hint that deeper models expose some overhead in the native path specifically on SM 86. Worth a look if you want to chase it.

Happy to run additional configs (ctx sweep, different models) if useful.

@TheTom

TheTom commented Apr 8, 2026

Copy link
Copy Markdown
Owner

@seanrasch thanks for the Ampere validation, that was the last unchecked box on the test plan. really appreciate the extra eyes.

good to confirm no regression on SM 86. the parity result makes sense — the kernel wins (shmem LUT, half2, vectorized loads) are latency optimizations that only show up when you're not bandwidth-bound, which Ampere is at these model sizes.

the 5% prefill gap on deeper models is a useful data point. lines up with what we'd expect from the per-element inverse WHT overhead in the native dequant path. default is conversion-on for prefill so it shouldn't bite anyone in practice.

thanks again for taking the time to test and write it up.

@signalnine signalnine deleted the pr/tq4-weight-opts branch April 8, 2026 01:50
Titaniumtown added a commit to Titaniumtown/llama-cpp-turboquant that referenced this pull request Apr 11, 2026
Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
Titaniumtown added a commit to Titaniumtown/llama-cpp-turboquant that referenced this pull request Apr 11, 2026
Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
Titaniumtown added a commit to Titaniumtown/llama-cpp-turboquant that referenced this pull request Apr 20, 2026
Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
TheTom pushed a commit that referenced this pull request Apr 20, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
TheTom pushed a commit that referenced this pull request Apr 22, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
LogicDaemon pushed a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request May 27, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
LogicDaemon pushed a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request May 27, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
wel97459 pushed a commit to wel97459/llama-cpp-turboquant that referenced this pull request Jun 4, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
wel97459 pushed a commit to wel97459/llama-cpp-turboquant that referenced this pull request Jun 4, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Jun 5, 2026
* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 7, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 10, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 11, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 13, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 13, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 13, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
LogicDaemon added a commit to LogicDaemon/llama-cpp-turboquant that referenced this pull request Jun 14, 2026
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.

vulkan: TQ4_1s support for model weights (#69)

* vulkan: add TQ4_1S weight compression support

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).

Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
  subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
  activation pre-rotation (forward RHT on activation, centroid*scale
  dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
  RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)

Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size

Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation

* vulkan: add fused mul_mat_vec kernel for TQ4_1S

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path.  The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.

Math:
  w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
  sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]

Portability choices:

- Workgroup size is pinned to 32 threads regardless of the
  DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
  the current architecture.  The butterfly operates on 32-element
  blocks with one element per thread; that contract is fixed by the
  quantization format, not by the GPU.  Earlier revisions used
  `gl_WorkGroupSize.x` as the stride unit, which silently skipped
  half the work on Intel drivers that force the subgroup to 16
  (tests passed via NMSE tolerance while real inference output was
  garbage).

- Butterfly implementation is shared memory only.  A subgroup-shuffle
  variant (`subgroupShuffleXor`) was prototyped and measured on Intel
  Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
  explicit shared-memory butterfly, because Mesa emulates subgroup
  shuffles via LDS and ends up doing the same LDS traffic with extra
  driver overhead.  The shared-memory butterfly is correct on every
  device regardless of subgroup-op support, is the fastest path on
  every device we can actually measure, and leaves the
  `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
  across all DMMV_WG_SIZE buckets.

- Reduction is the shared-memory tree reduction (no subgroupAdd), for
  the same reason: on Intel Arc the subgroupAdd is also LDS-backed
  and the hybrid reduction path was measurably slower.  Future
  vendor-specific heuristics can switch to the hybrid or pure-subgroup
  reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
  turn out to beat the LDS roundtrip there; the existing reduction
  modes in `mul_mat_vec_base.glsl` already provide the necessary
  variants.

- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
  per workgroup.  Each thread holds one position of each of the 8
  weight blocks and pairs them with the shared rotated activation.

- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
  TQ4_1S because it is a weight-only format that never reaches the
  coopmat2 matmul or the KV cache flash-attention paths.

Tests:

- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
  NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
  (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
  2048, 5120, 6144}, n in {1..8, 16, 64, 256}).  148 MUL_MAT test
  cases total.

Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):

| Model         | Config  |     Size  | Reduction | PPL ╬Ф  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570тЖТ1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873тЖТ2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263тЖТ2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263тЖТ2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.

All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8).  The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.

* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse

Splits the dequant+accumulate phase into two sub-loops:

  1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
     scale multiply, reads from weight buffer only).
  2. Read the rotated activation from shared memory ONCE per column,
     then FMA across all rows in a tight register loop.

This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2).  It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.

Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression).  The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)

fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10

The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.

Users can opt-in with TURBO_FLASH=1 for testing/debugging.

No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

fix(hip): bypass pool for FA f16 temp buffers to prevent OOM

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.

Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.

Compared to commit 1 (VEC force), this approach recovers prefill:
  pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
  pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
  Decode: unchanged across all configs

Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).

Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.

Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)

Refs: ggml-org/llama.cpp#20969

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: force VEC FA path for quantized KV on HIP/ROCm

The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.

This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).

Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.

Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.

Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).

Refs: ggml-org/llama.cpp#20969 (community reports)

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vulkan: fix turbo3 build + coopmat FA after April upstream sync

Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:

  * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
    rounding to a runtime SPIR-V patch and dropped the _rte shader
    variants plus rte.glsl itself.
  * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
    support against a base that still had those variants.

After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.

Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.

Changes:

  * Drop the if (float_controls_rte_fp16) / else branches around
    cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
    variant, matching upstream post-1f30ac0ce.
  * Remove the #include "rte.glsl" from copy_to_quant.comp.
  * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
    generator (no MMQ code path for it).
  * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
    and the _cm2 counterpart alongside the other quant types.

Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):

  * Full HIP+Vulkan build is clean with no shader compile errors.
  * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
  * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
    530 cases pass (previously aborted on case 3).
  * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
    still green (no HIP regression).
  * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
    -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
    correctness issue on the Vulkan turbo3 decode path is pre-existing
    and orthogonal to this change.

llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:

  F16     tg128=20.98   turbo3 tg128=20.13   turbo4 tg128=20.17

Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks

fix nix build: Add spirv-headers to vulkanBuildInputs

ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81

spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a

but not added to the nix build environment

fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu

The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.

The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.

CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.

Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF

Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline

Fixes #83

fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.

Extend the predicate to include f16 and bf16 alongside turbo + q8_0.

Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.

Co-Authored-By: tturney@psyguard.ai

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check

turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: fix turbo build and test failures

Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:

1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
   The abort branch's -Werror=switch was missing the three TurboQuant KV
   cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
   after the turbo types were added. Added them next to the TQ* cases so
   the fall-through to GGML_ABORT stays consistent.

2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
   GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
   static_assert was not updated. Bumped the patch version 1->2 and the
   assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.

3. tests/test-quantize-fns.cpp
   a) Heap-buffer-overflow (ASan trace): the scratch buffers
      std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
      vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
      because from_float then writes test_size*sizeof(float) bytes into
      them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
      so every quant's from_float fits.
   b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
      dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
      domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
      in the attention graph, so a round-trip against the original float
      buffer is not meaningful and the existing threshold table does not
      apply.
   c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
      MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
      MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
      3-bit error budgets instead of the default 4-bit ones.

4. ggml-cuda/turbo-quant.cuh
   Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
   calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
   no longer trips -Werror on unused-result.

Verified locally on Arch Linux (CPU build):
  ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
  => 100% tests passed, 0 tests failed out of 41
  (the excluded tokenizer test needs the model-fetch fixture)

test-quantize-fns runs clean under AddressSanitizer as well.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix GGML_OP_COUNT assertion for RPC

When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.

Fix memory explosion on Apple Silicon

Add GitHub Sponsors funding link

vulkan: add turbo3 backend tests

- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
  (f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
  configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
  tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.

vulkan: fix and complete turbo3 KV cache support

feat: add CDNA4 (gfx950/MI355X) support + test results

Add AMD Instinct MI355X (gfx950) architecture support:

Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
  * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8: mfma_i32_16x16x32_i8 (same as CDNA3)
  * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)

MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)

Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

docs: add AMD Instinct MI300X (gfx942) ROCm test results

TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.

Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)

MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).

Tested-by: Andy Luo <andyluo7@users.noreply.github.com>

metal: add TurboFlash attention kernel for turbo3 KV cache decode

Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.

Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
  - Each lane handles DK/32 interleaved dimensions
  - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
  - K scoring via simd_sum dot product
  - turbo3 V unpack with register codebook (8 centroids)
  - Online softmax (m/l/o state) entirely in registers
  - Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
  - Online softmax correction with global max/sum
  - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
  - Eliminates 5 of 7 threadgroup barriers vs naive butterfly

Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).

Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB:  +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)

Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Vulkan compute shader support for turbo3 KV cache

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).

Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.

Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
  shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch

Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix

- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
  loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
  pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)

Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.

Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.

For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.

perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0

Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).

Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
   on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4

Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
  of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
  to catch silent corruption that PPL alone misses.

Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
  Upstream V12 runtime:  67 t/s  (4.5 GiB VRAM)
  q8_0 conversion:     176 t/s  (7.5 GiB VRAM)
  Native optimized:    240 t/s  (4.5 GiB VRAM)  тЖР this PR

Quality (vs f16 baseline):
  PPL:      7.54 (f16: 7.18, q8_0 conv: 7.55)
  Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
  NIAH:     5/5
  Coherence: 4/4 (Paris, 4, print, Shakespeare)

fix: cap map0 kernel shmem for 256-expert MoE models

Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.

Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.

Fixes #58

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add MoE expert count kernel instantiations + TQ4_1S backend tests

Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
  Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
  with flash attention тАФ needs chunked map0 dispatch (follow-up)

Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size

extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.

Reported by Madreag on RTX 5090 WSL2.

Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.

Enhance Metal operations for TQ weights and concurrency handling for Gemma 4

- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.

fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)

GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.

Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed

Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.

signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).

Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).

Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).

Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix turbo4 C reference WHT dequant mismatch (#43)

Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: AMD HIP/ROCm build support for TQ4_1S weight compression

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: Windows MSVC build compatibility for TQ weight types

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup

Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s тЖР BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add post-unrotate memory barrier for in-layer mixing safety

Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels

Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove unused CENTROIDS_1BIT constant

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add missing TurboQuant FA template instances for HIP/ROCm build

  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V

Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)

The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).

Increase turbo3/turbo2 block size from 32 to 128

One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: KV state serialization uses padded tensor width (issue #28 follow-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Boundary V (experimental) тАФ layer-aware V compression

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)

Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).

Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.

Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)

GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.

Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).

ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.

Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
  (ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
  (always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance

Tested:
  GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
  Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
  DeepSeek: 143 t/s, PPL 11.61 (unchanged)
  16/16 coherence, NIAH 9-10/11 at 32K

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)

Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.

This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
  head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
  head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)

Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
  k_cur/v_cur per-head via ggml_pad before set_rows, return padded
  views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
  V head_dim from attention output after inverse WHT. All three
  build_attn variants (KV, K-only/MLA, ISWA) updated.

PPL (wikitext-2, 512 context, 8 chunks):
  DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
    Previously: 344,304 with 64-group WHT (catastrophic)
  GLM-4.7 (576тЖТ640):  9.11 vs 14.97 baseline
  Qwen3.5 (128):      6.31 unchanged

Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations

Add full asymmetric K/V quantization support for Metal flash attention:

- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
  eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)

The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.

Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs

Closes #27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: asymmetric K/V quant support for Metal flash attention

- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
  underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)

4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.

Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.

Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).

Closes #19

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai

fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise

PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)

The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.

All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.

PPL summary (Qwen3.5, head_dim=128, wikitext-2):
  f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel

In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.

Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.

Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.

Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: InnerQ per-channel equalization + turbo2 64-group fallback

InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).

Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).

Math: <Q/s, s*K> = <Q, K> preserves dot products.

Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
  multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
  channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
  tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity

Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.

Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7

The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.

Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.

Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)

Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.

Block format: block_turbo2_0 = 10 bytes per 32 values
  - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
  - qs (8B): 2-bit centroid indices, 4 per byte

Usage: -ctk turbo2 -ctv turbo2

Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration

NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
  4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)

quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.

Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).

WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).

Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)

Users can now mix turbo3 and q8_0 independently for K and V caches:
  -ctk turbo3 -ctv q8_0   # compress K more, keep V at higher quality
  -ctk q8_0   -ctv turbo3 # opposite tradeoff

Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
  (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
  when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
  WHT must not fire. For MLA, V is a view of K so v->type correctly
  reflects K's turbo type.

Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
  q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
  turbo3/q8_0: 9/11 (1 extra miss, normal variance)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: 64-element WHT groups + MLA Q rotation fix (issue #13)

Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.

Key changes:

**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
  from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
  parameter (0=auto) to handle MLA where output dim differs from
  K head dim
- ops.cpp: CPU WHT parameterized on group_size

**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
  SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
  = garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
  and also added to the non-FA attention path.

**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot

Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.

Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
  log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
  ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
  so CPU flash attention can handle turbo3 K/V for models where CUDA FA
  returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy

Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)

Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.

Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
  so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
  K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})

Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0

The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.

Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].

Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.

Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)

k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).

Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
  тАв Shared-memory WHT butterfly (7 stages, no atomics)
  тАв Warp-reduce L2 norm + inter-warp accumulate via smem
  тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
  тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)

Also tighten flash-attention dequant paths (fattn-common.cuh):
  тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
    turbo3 block тАФ load qs/signs once instead of twice per pair
  тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
    signs byte for all 4 elements; unrolled float2 / half2 output pairs

Benchmark (Qwen3.5 35B, RTX 5090, tg128):
  Before: 71.86 t/s (0.75├Ч q8_0)
  After:  94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill

Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel.  The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).

launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true).  Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically.  Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14):  nb11*32*2/14 = ne[0]*sizeof(half). тЬУ

Before (VEC only):                    After (MMA for prefill):
  2K prefill:  5032 t/s (0.73├Ч q8_0)   6734 t/s (0.98├Ч q8_0)
  8K prefill:  3110 t/s (0.46├Ч q8_0)   6613 t/s (0.98├Ч q8_0)
 32K prefill:  1215 t/s (0.19├Ч q8_0)   6168 t/s (0.97├Ч q8_0)

Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).

Decode unchanged (VEC, ~0.64-0.82…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants